How to Compress MapReduce Job Output

June 29, 2022

How to Compress MapReduce Job Output

If you want to compress output of the MapReduce job in Hadoop that can be done per-job basis by setting properties in your job configuration or at a whole cluster level by setting the properties in mapred-site.xml.

Properties for compressing MapReduce job output

mapreduce.output.fileoutputformat.compress-Set to true if job outputs should be compressed. Default is false.
mapreduce.output.fileoutputformat.compress.type- If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK. Default is RECORD.
mapreduce.output.fileoutputformat.compress.codec- If the job outputs are compressed, which codec is to be used. Default is org.apache.hadoop.io.compress.DefaultCodec

Making changes in mapred-site.xml

If you want to compress the MapReduce job output for all the jobs running on a cluster then you can add these properties in mapred-site.xml.

<property>
  <name>mapreduce.output.fileoutputformat.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapreduce.output.fileoutputformat.compress.type</name>
  <value>RECORD</value>
</property>
<property>
  <name>mapreduce.output.fileoutputformat.compress.codec</name>
  <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

Making changes in Job configuration

If you want to compress output of the MapReduce job only for a specific MapReduce job then add properties in you job configuration.

FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

If you are using Sequence file format then you can set compression type too.

SequenceFileOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);

That's all for the topic How to Compress MapReduce Job Output. If something is missing or you have something to share about the topic please write a comment.

You may also like

KnpCode

June 29, 2022