June 29, 2022

How to Compress Map Phase Output in Hadoop MapReduce

In a Hadoop MapReduce job you can opt to compress output of the Map phase. Since the output of Map task is stored on local disk and data is also transferred across the network to reducer nodes, compressing map phase output should help your MapReduce job to run faster.

You can use a fast compressor like snappy or LZ4 for compressing map output as compressor is splittable or not, doesn’t matter in case of intermediate Map output.

In this tutorial configuration steps for compressing Map output are given using Snappy codec.

In case you don’t have native snappy compressor library you can install it using the following command in Ubuntu. Using native libraries for compression makes it faster and helps in improving performance of MapReduce job.

$ sudo apt-get install libsnappy-dev

Required config changes

If you want to compress output of the map phase using Snappy compression at the whole cluster level, set the following properties in mapred-site.xml:


Description for the properties is as follows-

  • mapreduce.map.output.compress- Should the outputs of the maps be compressed before being sent across the network. Default is false.
  • mapreduce.map.output.compress.codec- If the map outputs are compressed, then what codec should be used. Default is org.apache.hadoop.io.compress.DefaultCodec

If you want to set the property as per-job-basis for compressing the map output then you need to add following lines in your job.

Configuration conf = new Configuration();
conf.setBoolean("mapreduce.map.output.compress", true);
conf.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");

That's all for the topic How to Compress Map Phase Output in Hadoop MapReduce. If something is missing or you have something to share about the topic please write a comment.

You may also like

No comments:

Post a Comment