June 26, 2022

Sequence File Format in Hadoop

Sequence files in Hadoop are flat files that store data in the form of serialized key/value pairs. Sequence file format is one of the binary file format supported by Hadoop and it integrates very well with MapReduce (also Hive and PIG).

Some of the features of the Sequence files in Hadoop are as follows-

  1. Stores data in binary form so works well in scenarios where you want to store images in HDFS, model complex data structures as (key, value) pair.
  2. Sequence files in Hadoop support both compression and splitting. When you compress a sequence file whole file is not compressed as a single unit but the records or the block of records are compressed with in the sequence file. Because of that sequence file can support splitting even if the compressor used is not splittable like Snappy, Lz4 or Gzip.
  3. Sequence file can also be used as a container for storing a large number of small files. Since Hadoop works best with large files so storing large number of small files with in a sequence file makes processing more efficient and also requires less NameNode memory as it has to store metadata about one sequence file rather than many small files.
  4. Since data is stored in (key, value) pair in Sequence file, internally the temporary outputs of maps are stored using SequenceFile.

SequenceFile Compression types

For sequence files in Hadoop there are three choices for compression.

  1. NONE- Both key/value are uncompressed.
  2. RECORD- If sequence file compression type is RECORD then only values are compressed.
  3. BLOCK- If sequence file compression type is BLOCK then both keys and values are compressed. Both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable. You will have to modify the following property in core-site.xml.
    io.seqfile.compress.blocksize- The minimum block size for compression in block compressed SequenceFiles. Default is 1000000 bytes (1 million bytes).

Sync points in sequence file

In Sequence file sync-markers are recorded every few 100 bytes. Because of these sync points sequence file is splittale and can be used as input to MapReduce.

SequenceFile Formats in Hadoop

There are three different sequence file formats depending on the selected compression type. Note that the header format remains same across all.

SequenceFile Header format

  • Version- 3 bytes of magic header SEQ, followed by 1 byte of actual version number (e.g. SEQ4 or SEQ6)
  • KeyClassName- key class
  • ValueClassName- value class
  • Compression- A boolean which specifies if compression is turned on for keys/values in this file.
  • BlockCompression- A boolean which specifies if block-compression is turned on for keys/values in this file.
  • Compression codec- CompressionCodec class which is used for compression of keys and/or values (if compression is enabled).
  • Metadata- SequenceFile.Metadata for this file.
  • Sync- A sync marker to denote end of the header.

Uncompressed SequenceFile Format

    • Header
    • Record
      • Record length
      • Key length
      • Key
      • Value
    • A sync-marker every few 100 bytes or so.

Record-Compressed SequenceFile Format

  • Header
  • Record
    • Record length
    • Key length
    • Key
    • Compressed Value
  • A sync-marker every few 100 bytes or so.
Sequence files in hadoop

Block-Compressed SequenceFile Format

  • Header
  • Record Block
    • Uncompressed number of records in the block
    • Compressed key-lengths block-size
    • Compressed key-lengths block
    • Compressed keys block-size
    • Compressed keys block
    • Compressed value-lengths block-size
    • Compressed value-lengths block
    • Compressed values block-size
    • Compressed values block
  • A sync-marker every block.
Sequence file block format

SequenceFile classes

SequenceFile provides SequenceFile.Writer, SequenceFile.Reader and SequenceFile.Sorter classes for writing, reading and sorting respectively.

There are three SequenceFile Writers based on the SequenceFile.CompressionType used to compress key/value pairs:

  • Writer: Uncompressed records.
  • RecordCompressWriter: Record-compressed files, only compress values.
  • BlockCompressWriter: Block-compressed files, both keys & values are compressed.

The recommended way is to use the static createWriter methods provided by the SequenceFile to chose the preferred format.

The SequenceFile.Reader can read any of the above SequenceFile formats.

That's all for the topic Sequence File Format in Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment