June 27, 2022

Java Program to Compress File in bzip2 Format in Hadoop

This post shows how to write a Java program to compress a file in HDFS using bzip2 compression. The program takes input file from local file system and write a BZip2 compressed file as output in HDFS.

Java program to compress file in bzip2 format

Hadoop compression codec that has to be used for bzip2 is org.apache.hadoop.io.compress.Bzip2Codec.

To get that codec getCodecByClassName() method of the CompressionCodecFactory class is used.

To create a CompressionOutputStream, createOutputStream(OutputStream out) method of the codec class is used. You will use CompressionOutputStream to write file data in compressed form to the stream.

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionOutputStream;

public class HDFSCompressWrite {
  public static void main(String[] args) {
    Configuration conf = new Configuration();
    InputStream in = null;
    OutputStream out = null;
    try {
      FileSystem fs = FileSystem.get(conf);
      // Input file from local file system
      in = new BufferedInputStream(new FileInputStream("/home/knpcode/Documents/knpcode/Hadoop/Test/data.txt"));
      //Compressed Output file
      Path outFile = new Path("/user/compout/test.bz2");
      // Verification
      if (fs.exists(outFile)) {
        System.out.println("Output file already exists");
        throw new IOException("Output file already exists");
      }
      out = fs.create(outFile);
			
      // For bzip2 compression
      CompressionCodecFactory factory = new CompressionCodecFactory(conf);
      CompressionCodec codec = factory.getCodecByClassName("org.apache.hadoop.io.compress.BZip2Codec");
      CompressionOutputStream compressionOutputStream = codec.createOutputStream(out);
      
      try {
        IOUtils.copyBytes(in, compressionOutputStream, 4096, false);
        compressionOutputStream.finish();
        
      } finally {
        IOUtils.closeStream(in);
        IOUtils.closeStream(compressionOutputStream);
      }
      
    } catch (IOException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
  }
}

Executing program in Hadoop environment

To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.

export HADOOP_CLASSPATH='/huser/eclipse-workspace/knpcode/bin'

I have my HDFSCompressWrite.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.

Then you can run the program using the following command-

$ hadoop org.knpcode.HDFSCompressWrite

18/03/09 17:10:04 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
18/03/09 17:10:04 INFO compress.CodecPool: Got brand-new compressor [.bz2]

The input file used in the program is large enough to ensure that even after compression file size is more than 128 MB, that way we can ensure that is stored as two separate blocks in HDFS. Since Compressing File in bzip2 Format in Hadoop supports splits so a MapReduce job having this compressed file as input should be able to create 2 separate input splits corresponding to two blocks.

First to check whether the compressed output file in bzip2 format is created or not.

$ hdfs dfs -ls /user/compout

Found 1 items
-rw-r--r--   1 knpcode supergroup  228651107 2018-03-09 17:11 /user/compout/test.bz2

You can see compressed file size is around 228 MB so it should be stored as two separate blocks in HDFS.

You can check that using HDFS fsck command.

$ hdfs fsck /user/compout/test.bz2

 Status: HEALTHY
 Total size:	228651107 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
Total blocks (validated):	2 (avg. block size 114325553 B)
 Minimally replicated blocks:	2 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)

FSCK ended at Fri Mar 09 17:17:13 IST 2018 in 3 milliseconds

If you give this compressed file as input to a MapReduce job, the MapReduce job should be able to create two input splits as bzip2 format supports splitting. To check that gave this file as input to a wordcount MapReduce program.

$ hadoop jar /home/knpcode/Documents/knpcode/Hadoop/wordcount.jar org.knpcode.WordCount /user/compout/test.bz2 /user/output2

18/03/11 12:48:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/03/11 12:48:29 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/03/11 12:48:30 INFO input.FileInputFormat: Total input files to process : 1
18/03/11 12:48:30 INFO mapreduce.JobSubmitter: number of splits:2

As you can see in this statement displayed on the console "mapreduce.JobSubmitter: number of splits:2" two splits are created for the map tasks.

That's all for the topic Java Program to Compress File in bzip2 Format in Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment