June 26, 2022

Parquet File Format in Hadoop

Apache Parquet is a columnar storage format used in the Apache Hadoop eco system.

What is a column oriented format

Before going into Parquet file format in Hadoop let's first understand what is column oriented file format and what benefit does it provide.

In a column oriented storage format, values are stored columns wise i.e. values of each row in the same column are stored rather than storing the data row wise as in the traditional row type data format.

As example if there is a table with 3 columns ID (int), NAME (varchar) and AGE (int)

ID NAME AGE
1 N1 35
2 N2 45
3 N3 55

Then in a row wise storage format the data will be stored as follows-

1 N1 35 2 N2 45 3 N3 55

In columnar format same data will be stored column-wise as follows-

1 2 3 N1 N2 N3 35 45 55

Benefits of using Columnar Storage format

As you can see from the layout in the above example, even if you query only the Name column, in the row oriented format whole row will be loaded into the memory. With the column oriented format if the Name is queried, only the Name column will be read into memory. That way query performance is improved as less I/O is required to read the same data.

Also you can notice from the layout that the data of the same data type is residing adjacent to each other. That helps in compressing the data better so less storage is required.

Parquet file format

Parquet file format being the columnar oriented format brings the same benefit in terms of-

  1. Less storage
  2. Increased query performance

Apart from that Parquet format also has a feature to store even the nested structures in the columnar oriented format. Other columnar formats tend to store nested structures by flattening it and storing only the top level in columnar format.

Parquet file format can be used with any Hadoop ecosystem like Hive, Impala, Pig, and Spark.

Parquet file format Structure

A parquet file consists of Header, Row groups and Footer. The format is as follows-

Parquet file format in hadoop
  • Header- The header contains a 4-byte magic number "PAR1" which means the file is a Parquet format file.
  • Row group- A logical horizontal partitioning of the data into rows. A row group consists of a column chunk for each column in the dataset.
  • Column chunk- A chunk of the data for a particular column.
  • Page- Column chunks are divided up into pages.
  • Footer- Contains the file metadata which includes the version of the format, schema, extra key/value pairs and the locations of all the column metadata start locations. Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.

Refer How to Read And Write Parquet File in Hadoop to see how to read and write parquet file in Hadoop using Java API and using MapReduce.

Types in Parquet format

The types supported by the parquet file format are intended to be as minimal as possible, with a focus on how the types effect on disk storage. The types are:

  • BOOLEAN: 1 bit boolean
  • INT32: 32 bit signed ints
  • INT64: 64 bit signed ints
  • INT96: 96 bit signed ints
  • FLOAT: IEEE 32-bit floating point values
  • DOUBLE: IEEE 64-bit floating point values
  • BYTE_ARRAY: arbitrarily long byte arrays.

Logical Types in Parquet format

Logical types are used to extend the types that parquet can be used to store, by specifying how the primitive types should be interpreted. This keeps the set of primitive types to a minimum and reuses parquet’s efficient encodings.

Full list of logical types can be accessed here- https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

That's all for the topic Parquet File Format in Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment