Java Program to Compress File in gzip Format in Hadoop

June 27, 2022

Java Program to Compress File in gzip Format in Hadoop

In this post we'll see a Java program that shows how to compress file using gzip format in Hadoop.

Compression format gzip does not support splitting so MapReduce job won’t be able to create input splits though compressed file can still be stored as separate HDFS blocks (Size 128 MB by default).

Java program to compress file using gzip format

Hadoop compression codec that has to be used for gzip is org.apache.hadoop.io.compress.GzipCodec.

To get that codec getCodecByClassName() method of the CompressionCodecFactory class is used. To create a CompressionOutputStream, createOutputStream(OutputStream out) method of the codec class is used. You will use CompressionOutputStream to write file data in compressed form to the stream.

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionOutputStream;

public class GzipCompress {
  public static void main(String[] args) {
    Configuration conf = new Configuration();
    InputStream in = null;
    OutputStream out = null;
    try {
      FileSystem fs = FileSystem.get(conf);
      // Input file from local file system
      in = new BufferedInputStream(new FileInputStream("/home/knpcode/Documents/knpcode/Hadoop/Test/data.txt"));
      //Compressed Output file
      Path outFile = new Path("/user/compout/test.gz");
      // Verification
      if (fs.exists(outFile)) {
        System.out.println("Output file already exists");
        throw new IOException("Output file already exists");
      }			
      out = fs.create(outFile);
			
      // For gzip compression
      CompressionCodecFactory factory = new CompressionCodecFactory(conf);
      CompressionCodec codec = factory.getCodecByClassName("org.apache.hadoop.io.compress.GzipCodec");
      CompressionOutputStream compressionOutputStream = codec.createOutputStream(out);      
      try {
        IOUtils.copyBytes(in, compressionOutputStream, 4096, false);
        compressionOutputStream.finish();
        
      } finally {
        IOUtils.closeStream(in);
        IOUtils.closeStream(compressionOutputStream);
      }
			
    } catch (IOException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
  }
}

Executing program in Hadoop environment

To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.

export HADOOP_CLASSPATH='/huser/eclipse-workspace/knpcode/bin'

I have my GzipCompress.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.

Then you can run the program using the following command-

$ hadoop org.knpcode.GzipCompress

18/03/11 12:59:49 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
18/03/11 12:59:49 INFO compress.CodecPool: Got brand-new compressor [.gz]

The input file used in the program is large enough to ensure that even after compression file size is more than 128 MB, that way we can ensure that is stored as two separate blocks in HDFS.

You can check that by using hdfs fsck command.

$ hdfs fsck /user/compout/test.gz

.Status: HEALTHY
 Total size:	233963084 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	2 (avg. block size 116981542 B)

FSCK ended at Wed Mar 14 21:07:46 IST 2018 in 6 milliseconds

Since gzip doesn’t support splitting so using this compressed file as input for a MapReduce job will mean only one split will be created for the Map task.

To test how many input splits are created gave this compressed gzip file as input to the wordcount MapReduce program.

$ hadoop jar /home/knpcode/Documents/knpcode/Hadoop/wordcount.jar org.knpcode.WordCount /user/compout/test.gz /user/output3

18/03/11 13:09:23 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/03/11 13:09:23 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/03/11 13:09:23 INFO input.FileInputFormat: Total input files to process : 1
18/03/11 13:09:24 INFO mapreduce.JobSubmitter: number of splits:1

As you can see in this line displayed on the console mapreduce.JobSubmitter: number of splits:1 only one input split is created for the MapReduce job even if there are two HDFS blocks as gzip compressed file is not splittable.

That's all for the topic Java Program to Compress File in gzip Format in Hadoop. If something is missing or you have something to share about the topic please write a comment.

You may also like

KnpCode

June 27, 2022