June 29, 2022

How to Compress MapReduce Job Output

If you want to compress output of the MapReduce job in Hadoop that can be done per-job basis by setting properties in your job configuration or at a whole cluster level by setting the properties in mapred-site.xml.

Properties for compressing MapReduce job output

  • mapreduce.output.fileoutputformat.compress-Set to true if job outputs should be compressed. Default is false.
  • mapreduce.output.fileoutputformat.compress.type- If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK. Default is RECORD.
  • mapreduce.output.fileoutputformat.compress.codec- If the job outputs are compressed, which codec is to be used. Default is org.apache.hadoop.io.compress.DefaultCodec

Making changes in mapred-site.xml

If you want to compress the MapReduce job output for all the jobs running on a cluster then you can add these properties in mapred-site.xml.

<property>
  <name>mapreduce.output.fileoutputformat.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapreduce.output.fileoutputformat.compress.type</name>
  <value>RECORD</value>
</property>
<property>
  <name>mapreduce.output.fileoutputformat.compress.codec</name>
  <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

Making changes in Job configuration

If you want to compress output of the MapReduce job only for a specific MapReduce job then add properties in you job configuration.

FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

If you are using Sequence file format then you can set compression type too.

SequenceFileOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);

That's all for the topic How to Compress MapReduce Job Output. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment