Data Compression in Hadoop Framework

In Hadoop framework, where large data sets are stored and processed, you will need storage for large files. These files are divided into blocks and those blocks are stored in different nodes across the cluster so lots of I/O and network data transfer is also involved. In order to reduce the storage requirements and to reduce the time spent in network transfer you can have a look at data compression in Hadoop framework.

Table of contents

What can you compress
Compression and splitting
Compression formats in Hadoop
Compression Codecs in Hadoop
Performance overhead with compression

What can you compress

Using data compression in Hadoop you can compress files at various steps, at all of these steps it will help to reduce storage and quantity of data transferred.

Compressing input files

You can compress the input file itself. That will help you reduce storage space in HDFS. One thing to consider here is that the compression format used is splittable or not (See section Compression and splitting for more details).

If you compress the input files then the files will be decompressed automatically when processed by a MapReduce job. Based on the extension of the file appropriate codec will be used.

Compressing intermediate map output

You can compress the intermediate map outputs, since map outputs are written to disk so storage is saved, also the map outputs from many mappers is sent to reducer nodes so data transfer across the nodes is also reduced.

Compressing output files

You can also configure that the output of a MapReduce job is compressed in Hadoop. That helps is reducing storage space if you are archiving output or sending it to some other application for further processing.

Refer How to Compress MapReduce Job Output to see how to compress output of a MapReduce job.

Compression and splitting

When you compress an input file in Hadoop that has to be processed by a MapReduce job, you will also have to consider the possibility whether MapReduce job will be able to read those compressed blocks as separate splits or not.

Generally when you store a file in HDFS, it will be divided into blocks of 128 MB and stored. A MapReduce job which uses this file as input will create as many input splits as there are blocks. These input splits will then be processed by separate map tasks in parallel.

As example- If you have a 1 GB file it will be stored as 8 data blocks in HDFS. MapReduce job that uses this file will also create 8 input splits and these input splits will be processed by separate map tasks in parallel then.

If you have a compressed 1 GB file where compression format used is not splittable like gzip then HDFS still stores the file as 8 separate blocks. But the MapReduce job, at the time of processing these compressed blocks, won’t be able to create input splits for each block because it is not possible to read at arbitrary point in a gzipped file.

Since it is not possible to create input splits in this scenario, so a single map task will process all the HDFS blocks. End result is you lost the advantage of parallel processing as only one map task is processing all the data and there is data transfer overhead too as all the blocks are to be transferred to the node where map task is running.

That is why it is important to consider, while compressing the input file, that the compression format used is splittable or not.

Compression formats in Hadoop

There are several compression formats available for use in Hadoop framework. Some of them compress better (more space saving, better data compression ratio) while others compress and decompress faster (though compress less).

You will also need to consider the fact whether the compression format is splittable or not.

Deflate- It is the compression algorithm used by zlib as well as gzip compression tools. Filename extension is .deflate.

gzip– Gzip provides a high compression ratio but not as fast as Lzo or Snappy. It is not splittable. Filename extension is .gz. Better suited to be used with the data that is not accessed frequently.

bzip2- Bzip2 provides a higher compression ratio than gzip but compression and decompression speed is less. Bzip2 is the only compression format that has splittable support with in Hadoop. In Hadoop framework there is an interface SplittableCompressionCodec which is meant to be implemented by those compression codecs which are capable to compress / de-compress a stream starting at any arbitrary position. BZip2Codec is the only implementing class of this interface. Filename extension is .bz2.

Refer Java Program to Compress File in bzip2 Format in Hadoop to see how to use bzip2 compression in Hadoop.

LZO- It is optimized for speed so compression ratio is less. Though not splittable by default but you can index the lzo files to make them splittable in Hadoop. Filename extension is .lzo.

LZ4- It is optimized for speed so compression ratio is less. It is not splittable. Though there is a library (4MC) that can make lz4 files splittable. Refer https://github.com/carlomedas/4mc. Filename extension is .lz4.

Snappy- Concentrates more on speed of compression and decompression so compression ratio is less. It is not splittable. Filename extension is .snappy.

Zstandard- Zstandard is a real-time compression algorithm, providing high compression ratios along with high speed. It is not splittable. Though there is a library (4MC) that can make lz4 files splittable. Refer https://github.com/carlomedas/4mc. Filename extension is .zstd.

Compression Codecs in Hadoop

Hadoop framework provides implementations of compression-decompression algorithm, there are different codec (compressor/decompressor) classes for different compression formats. When you are doing data compression in Hadoop you will use one of these codecs.

Deflate- org.apache.hadoop.io.compress.DefaultCodec or org.apache.hadoop.io.compress.DeflateCodec (An alias for DefaultCodec). If you see the code for DefaultCodec it uses zlib compression.

Gzip– org.apache.hadoop.io.compress.GzipCodec

Bzip2– org.apache.hadoop.io.compress.Bzip2Codec

LZ4- org.apache.hadoop.io.compress.Lz4Codec

Snappy– org.apache.hadoop.io.compress.SnappyCodec

Zstandard– org.apache.hadoop.io.compress.ZstandardCodec

LZO- com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec (For lzop tool, this is the one you should be using).

Note that LZO libraries are licensed differently so not part of Hadoop release. You will have to download Hadoop codec for LZO separately.

Refer How to Use LZO Compression in Hadoop to see the required steps for using LZO compression in Hadoop.

Performance overhead with compression

Data compression in Hadoop do provide benefits in the form of less storage and less data transfer and on most of the cases it outweighs the overhead but try to test with your data what works best for you.

Overhead with data compression in Hadoop is that there is added processing involved in form of compressing the data and then decompressing the data when it has to be processed. Take the case of compressing the Map output where you save on space and also there is less data transfer as output of map tasks is sent to reducer nodes. At the same time processing cycle increases as there is extra processing when map output is compressed and later decompressed so that reduce task can process it.

That's all for the topic Data Compression in Hadoop Framework. If something is missing or you have something to share about the topic please write a comment.

You may also like

KnpCode

June 27, 2022