June 27, 2022

How to Read And Write Parquet File in Hadoop

In this post we’ll see how to read and write Parquet file in Hadoop using the Java API. We’ll also see how you can use MapReduce to write Parquet files in Hadoop.

Rather than using the ParquetWriter and ParquetReader directly AvroParquetWriter and AvroParquetReader are used to write and read parquet files.

AvroParquetWriter and AvroParquetReader classes will take care of conversion from Avro schema to Parquet schema and also the types.

Required Jars

To write Java programs to read and write Parquet files you will need to put following jars in classpath. You can add them as Maven dependency or copy the jars.

  • avro-1.8.2.jar
  • parquet-hadoop-bundle-1.10.0.jar
  • parquet-avro-1.10.0.jar
  • jackson-mapper-asl-1.9.13.jar
  • jackson-core-asl-1.9.13.jar
  • slf4j-api-1.7.25.jar

Java program to write parquet file

Since Avro is used so you’ll need avro schema.

schema.avsc
{
  "type":	"record",
  "name":	"testFile",
  "doc":	"test records",
  "fields": 
    [{
      "name":	"id",	
      "type":	"int"
      
    }, 
    {
      "name":	"empName",
      "type":	"string"
    }
  ]
}
Java code
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetWriter;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;

public class ExampleParquetWriter {	
  public static void main(String[] args) {    
    Schema schema = parseSchema();
    List<GenericData.Record> recordList = createRecords(schema);
    writeToParquetFile(recordList, schema);    
  }
	
  // Method to parse the schema
  private static Schema parseSchema() {
    Schema.Parser parser = new	Schema.Parser();
    Schema schema = null;
    try {
      // Path to schema file
      schema = parser.parse(ClassLoader.getSystemResourceAsStream("resources/schema.avsc"));      
    } catch (IOException e) {
      e.printStackTrace();			
    }
    return schema;		
  }
	
  private static List<GenericData.Record> createRecords(Schema schema){
    List<GenericData.Record> recordList = new ArrayList<>();
    for(int i = 1; i <= 10; i++) {
      GenericData.Record record = new GenericData.Record(schema);
      record.put("id", i);
      record.put("empName", i+"a");
      recordList.add(record);
    }
    return recordList;
  }
	
  private static void writeToParquetFile(List<GenericData.Record> recordList, Schema schema) {
    // Output path for Parquet file in HDFS
    Path path = new Path("/user/out/data.parquet");
    ParquetWriter<GenericData.Record> writer = null;
    // Creating ParquetWriter using builder
    try {
      writer = AvroParquetWriter.
        <GenericData.Record>builder(path)
        .withRowGroupSize(ParquetWriter.DEFAULT_BLOCK_SIZE)
        .withPageSize(ParquetWriter.DEFAULT_PAGE_SIZE)
        .withSchema(schema)
        .withConf(new Configuration())
        .withCompressionCodec(CompressionCodecName.SNAPPY)
        .withValidation(false)
        .withDictionaryEncoding(false)
        .build();
      // writing records
      for (GenericData.Record record : recordList) {
        writer.write(record);
      }      
    }catch(IOException e) {
      e.printStackTrace();
    }finally {
      if(writer != null) {
        try {
          writer.close();
        } catch (IOException e) {
          // TODO Auto-generated catch block
          e.printStackTrace();
        }
      }
    }
  }
}

Executing program in Hadoop environment

Before running this program in Hadoop environment you will need to put the above mentioned jars in HADOOP_INSTALLATION_DIR/share/hadoop/mapreduce/lib.

Also put the current version Avro-1.x.x jar in the location HADOOP_INSTALLATION_DIR/share/hadoop/common/lib if there is a version mismatch.

To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.

$ export HADOOP_CLASSPATH='/huser/eclipse-workspace/knpcode/bin'

I have my ExampleParquetWriter.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.

Then you can run the program using the following command-

$ hadoop org.knpcode.ExampleParquetWriter


18/06/06 12:15:35 INFO compress.CodecPool: Got brand-new compressor [.snappy]
18/06/06 12:15:35 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 2048

Java program to read parquet file

To read the Parquet file created in HDFS using the above program you can use the following method.

  private static void readParquetFile() {
    ParquetReader reader = null;
    Path path =	new Path("/user/out/data.parquet");
    try {
      reader = AvroParquetReader
                .builder(path)
                .withConf(new Configuration())
                .build();
      GenericData.Record record;
      while ((record = reader.read()) != null) {
        System.out.println(record);
      }
    }catch(IOException e) {
      e.printStackTrace();
    }finally {
      if(reader != null) {
        try {
          reader.close();
        } catch (IOException e) {
          // TODO Auto-generated catch block
          e.printStackTrace();
        }
      }
    }
  }

$ hadoop org.knpcode.ExampleParquetWriter

18/06/06 13:33:47 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 10 records.
18/06/06 13:33:47 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
18/06/06 13:33:47 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
18/06/06 13:33:47 INFO hadoop.InternalParquetRecordReader: block read in memory in 44 ms. row count = 10
{"id": 1, "empName": "1a"}
{"id": 2, "empName": "2a"}
{"id": 3, "empName": "3a"}
{"id": 4, "empName": "4a"}
{"id": 5, "empName": "5a"}
{"id": 6, "empName": "6a"}
{"id": 7, "empName": "7a"}
{"id": 8, "empName": "8a"}
{"id": 9, "empName": "9a"}
{"id": 10, "empName": "10a"}

Note that builder with org.apache.hadoop.fs.Path instance as argument is deprecated.

You can also use parquet-tools jar to see the content or schema of the parquet file.

Once you download the parquet-tools-1.10.0.jar to see the conent of the file you can use the following command.

$ hadoop jar /path/to/parquet-tools-1.10.0.jar cat /user/out/data.parquet

To see the schema of a parquet file.

$ hadoop jar /path/to/parquet-tools-1.10.0.jar schema /user/out/data.parquet

message testFile {
  required int32 id;
  required binary empName (UTF8);
}

MapReduce to write a Parquet file

In this example a text file is converted to a parquet file using MapReduce. Its a mapper only job so number of reducers is set to zero.

For this program a simple text file (stored in HDFS) with only two lines is used.

This is a test file.
This is a Hadoop MapReduce program file.
MapReduce Java code
import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.parquet.avro.AvroParquetOutputFormat;
import org.apache.parquet.example.data.Group;

public class ParquetFile extends Configured implements Tool{
  public static void main(String[] args)  throws Exception{	
    int exitFlag = ToolRunner.run(new ParquetFile(), args);
    System.exit(exitFlag);
  }
  /// Schema
  private static final Schema AVRO_SCHEMA = new Schema.Parser().parse(
    "{\n" +
    "	\"type\":	\"record\",\n" +				
    "	\"name\":	\"testFile\",\n" +
    "	\"doc\":	\"test records\",\n" +
    "	\"fields\":\n" + 
    "	[\n" + 
    "			{\"name\": \"byteofffset\",	\"type\":	\"long\"},\n"+ 
    "			{\"name\":	\"line\", \"type\":	\"string\"}\n"+
    "	]\n"+
    "}\n");
	
  // Map function
  public static class ParquetMapper extends Mapper<LongWritable, Text, Void, GenericRecord> {
    
    private	GenericRecord record = new GenericData.Record(AVRO_SCHEMA);
    public void map(LongWritable key, Text value, Context context) 
        throws IOException, InterruptedException {
      record.put("byteofffset", key.get());
      record.put("line", value.toString());
      context.write(null, record); 
    }		
  }

  @Override
  public int run(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "parquet");
    job.setJarByClass(ParquetFile.class);
    job.setMapperClass(ParquetMapper.class);    
    job.setNumReduceTasks(0);
    job.setOutputKeyClass(Void.class);
    job.setOutputValueClass(Group.class);
    job.setOutputFormatClass(AvroParquetOutputFormat.class);
    // setting schema to be used
    AvroParquetOutputFormat.setSchema(job, AVRO_SCHEMA);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    return job.waitForCompletion(true) ? 0 : 1;
  }
}
Running the MapReduce program
hadoop jar /path/to/jar org.knpcode.ParquetFile /user/input/count /user/out/parquetFile

Using parquet-tools you can see the content of the parquet file.

hadoop jar /path/to/parquet-tools-1.10.0.jar cat  /user/out/parquetFile/part-m-00000.parquet

18/06/06 17:15:04 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records.
18/06/06 17:15:04 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
18/06/06 17:15:04 INFO hadoop.InternalParquetRecordReader: block read in memory in 20 ms. row count = 2

byteofffset = 0
line = This is a test file.

byteofffset = 21
line = This is a Hadoop MapReduce program file.

MapReduce to read a Parquet file

This example shows how you can read a Parquet file using MapReduce. The example reads the parquet file written in the previous example and put it in a file.

The record in Parquet file looks as following.

byteofffset: 0
line: This is a test file.

byteofffset: 21
line: This is a Hadoop MapReduce program file.

Since only the line part is needed in the output file so you first need to split the record and then again split the value of the line column.

MapReduce Java code

import java.io.IOException;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.hadoop.example.ExampleInputFormat;

public class ParquetFileRead extends Configured implements Tool{

  public static void main(String[] args)  throws Exception{
    int exitFlag = ToolRunner.run(new ParquetFileRead(), args);
    System.exit(exitFlag);
  }
  // Map function
  public static class ParquetMapper1 extends Mapper<LongWritable, Group, NullWritable, Text> {
    public static final Log log = LogFactory.getLog(ParquetMapper1.class);
    public void map(LongWritable key, Group value, Context context) 
        throws IOException, InterruptedException {
      NullWritable outKey = NullWritable.get();
      String line = value.toString();
      String[] fields = line.split("\n");
      String[] record = fields[1].split(": ");
      context.write(outKey, new Text(record[1]));           
    }		
  }
	
  @Override
  public int run(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "parquet1");
    job.setJarByClass(getClass());
    job.setMapperClass(ParquetMapper1.class);    
    job.setNumReduceTasks(0);
    
    job.setMapOutputKeyClass(LongWritable.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
  
    job.setInputFormatClass(ExampleInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    return job.waitForCompletion(true) ? 0 : 1;
  }
}
Running the MapReduce program
hadoop jar /path/to/jar org.knpcode.ParquetFileRead /user/out/parquetFile/part-m-00000.parquet /user/out/data
File content
$ hdfs dfs -cat /user/out/data/part-m-00000

This is a test file.
This is a Hadoop MapReduce program file.

That's all for the topic How to Read And Write Parquet File in Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment