Skip to main content

Posts

How to Read And Write Parquet File in Hadoop

In this post we’ll see how to read and write Parquet file in Hadoop using the Java API. We’ll also see how you can use MapReduce to write Parquet files in Hadoop. Rather than using the ParquetWriter and ParquetReader directly AvroParquetWriter and AvroParquetReader are used to write and read parquet files. AvroParquetWriter and AvroParquetReader classes will take care of conversion from Avro schema to Parquet schema and also the types. Table of contents Required Jars Java program to write parquet file Java program to read parquet file MapReduce to write a Parquet file MapReduce to read a Parquet file Required Jars To write Java programs to read and write Parquet files you will need to put following jars in classpath. You can add them as Maven dependency or copy the jars. avro-1.8.2.jar parquet-hadoop-bundle-1.10.0.jar parquet-avro-1.10.0.jar jackson-mapper-asl-1.9.13.jar jackson-core-asl-1.9.13.jar slf4j-api-1.7.25.jar Ja

Parquet File Format in Hadoop

Apache Parquet is a columnar storage format used in the Apache Hadoop eco system. Table of contents What is a column oriented format Benefits of using Columnar Storage format Parquet file format Parquet file format Structure Types in Parquet format Logical Types in Parquet format What is a column oriented format Before going into Parquet file format in Hadoop let's first understand what is column oriented file format and what benefit does it provide. In a column oriented storage format, values are stored columns wise i.e. values of each row in the same column are stored rather than storing the data row wise as in the traditional row type data format. As example if there is a table with 3 columns ID (int), NAME (varchar) and AGE (int) ID NAME AGE 1 N1 35 2 N2 45 3 N3 55 Then in a row wise storage format the data will be stored as follows- 1 N1 35 2 N2 45 3 N3 55 In columnar format same data will be stored

Avro MapReduce Example

This post shows an Avro MapReduce example program using the Avro MapReduce API. As an example word count MapReduce program is used where the output will be an Avro data file. Required jars avro-mapred-1.8.2.jar Avro word count MapReduce example Since output is Avro file so an Avro schema has to be defined, we’ll have two fields in the schema "word" and "count". In the code you can see the use of AvroKey and AvroValue for the key and value pairs. Also for output AvroKeyOutputFormat class is used. To define the map output and the output of a MaReduce job AvroJob class is used for job configuration. Avro MapReduce import java.io.IOException; import org.apache.avro.Schema; import org.apache.avro.generic.GenericData; import org.apache.avro.generic.GenericRecord; import org.apache.avro.mapred.AvroKey; import org.apache.avro.mapred.AvroValue; import org.apache.avro.mapreduce.AvroJob; import org.apache.avro.mapreduce.AvroKeyOutputFormat; import org.apach

How to Read And Write Avro Files in Hadoop

In this post we’ll see how to read and write Avro files in Hadoop using the Java API. Required Jars To write Java programs to read and write Avro files you will need to put following jars in classpath. You can add them as Maven dependency or copy the jars. avro-1.8.2.jar avro-tools-1.8.2.jar jackson-mapper-asl-1.9.13.jar jackson-core-asl-1.9.13.jar slf4j-api-1.7.25.jar Java program to write avro file Since Avro is used so you’ll need avro schema. schema.avsc { "type": "record", "name": "EmployeeRecord", "doc": "employee records", "fields": [{ "name": "id", "type": "int" }, { "name": "empName", "type": "string" }, { "name": "age", "type": "int" } ] } Java code import java.io.File; import java.io.

Avro File Format in Hadoop

Apache Avro is a data serialization system native to Hadoop which is also language independent. Apache Avro project was created by Doug Cutting, creator of Hadoop to increase data interoperability in Hadoop. Avro implementations for C, C++, C#, Java, PHP, Python, and Ruby are available making it easier to interchange data among various platforms. What is data serialization Just to make it clear here Data serialization is a mechanism to convert data (class objects, data structures) into a stream of bytes (binary form) in order to send it across network or store it persistently in a file or DB. Table of contents Avro in Hadoop Avro file format Schema Declaration in Avro Primitive Types in Avro Complex Types in Avro Avro in Hadoop Main features of Avro in Hadoop are- Avro is language independent It is schema based To define structure for Avro data, language-independent schema is used. Avro schemas are defined using JSON that helps in data intero

How to Read And Write SequenceFile in Hadoop

This post shows how to read and write SequenceFile in Hadoop using Java API, using Hadoop MapReduce and how can you provide compression options for a SequenceFile. Table of contents Writing a sequence file Java program Reading a sequence file Java program Writing SequenceFile using MapReduce Job Reading SequenceFile using MapReduce Job Writing a sequence file Java program SeqeunceFile provides a static method createWriter() to create a writer which is used to write a SequenceFile in Hadoop, there are many overloaded variants of createWriter method (many of them deprecated now) but here the method used is the following one. public static org.apache.hadoop.io.SequenceFile.Writer createWriter(Configuration conf, org.apache.hadoop.io.SequenceFile.Writer.Option... opts) throws IOException Java Code import java.io.File; import java.io.IOException; import org.apache.commons.io.FileUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Fi

Sequence File Format in Hadoop

Sequence files in Hadoop are flat files that store data in the form of serialized key/value pairs . Sequence file format is one of the binary file format supported by Hadoop and it integrates very well with MapReduce (also Hive and PIG). Some of the features of the Sequence files in Hadoop are as follows- Stores data in binary form so works well in scenarios where you want to store images in HDFS , model complex data structures as (key, value) pair. Sequence files in Hadoop support both compression and splitting. When you compress a sequence file whole file is not compressed as a single unit but the records or the block of records are compressed with in the sequence file. Because of that sequence file can support splitting even if the compressor used is not splittable like Snappy, Lz4 or Gzip. Sequence file can also be used as a container for storing a large number of small files . Since Hadoop works best with large files so storing large number of small files with in