June 27, 2022

How to Read And Write SequenceFile in Hadoop

This post shows how to read and write SequenceFile in Hadoop using Java API, using Hadoop MapReduce and how can you provide compression options for a SequenceFile.

Writing a sequence file Java program

SeqeunceFile provides a static method createWriter() to create a writer which is used to write a SequenceFile in Hadoop, there are many overloaded variants of createWriter method (many of them deprecated now) but here the method used is the following one.

public static org.apache.hadoop.io.SequenceFile.Writer createWriter(Configuration conf, org.apache.hadoop.io.SequenceFile.Writer.Option... opts)
throws IOException

Java Code

import java.io.File;
import java.io.IOException;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.Writer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.GzipCodec;

public class SFWrite {
  public static void main(String[] args) {
    Configuration conf = new Configuration();
    int i =0;
    try {
      FileSystem fs = FileSystem.get(conf);
      // input file in local file system
      File file = new File("/home/knpcode/Documents/knpcode/Hadoop/Test/data.txt");
      // Path for output file
      Path outFile = new Path(args[0]);
      IntWritable key = new IntWritable();
      Text value = new Text();
      SequenceFile.Writer writer = null;
      try {
        writer = SequenceFile.createWriter(conf, Writer.file(outFile), 
        Writer.keyClass(key.getClass()), Writer.valueClass(value.getClass()), 
        Writer.compression(SequenceFile.CompressionType.BLOCK, new GzipCodec()));
        for (String line : FileUtils.readLines(file)) {
          key.set(i++);
          value.set(line);
          writer.append(key, value);
        }
      }finally {
        if(writer != null) {
          writer.close();
        }
      }		
    } catch (IOException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
  }
}

In the program compression option is also given and the compression codec used is GzipCodec.

Executing program in Hadoop environment

To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.

export HADOOP_CLASSPATH='/huser/eclipse-workspace/knpcode/bin'

I have my SFWrite.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.

Then you can run the program using the following command-

$ hadoop org.knpcode.SFWrite /user/output/item.seq

18/03/22 12:10:21 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
18/03/22 12:10:21 INFO compress.CodecPool: Got brand-new compressor [.gz]

Here /user/output/item.seq is the output path in the HDFS.

If you try to display the file content in HDFS the content will not be readable as SequenceFile is a binary file format. That brings us to the second part how to read a sequence file.

Reading a sequence file Java program

To read a SequenceFile in Hadoop you need to get an instance of SequenceFile.Reader which can read any of the writer SequenceFile formats.

Using this reader instance you can iterate over the records by using the next() method, the variant of the next method used here takes both key and value as arguments of type Writable and assign the next (key, value) pair read from the sequence file into these variables.

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.Reader;
import org.apache.hadoop.io.Text;

public class SFRead {
  public static void main(String[] args) {
    Configuration conf = new Configuration();
    try {
      Path inFile = new Path(args[0]);
      SequenceFile.Reader reader = null;
      try {
        IntWritable key = new IntWritable();
        Text value = new Text();
        reader = new SequenceFile.Reader(conf, Reader.file(inFile), Reader.bufferSize(4096));
        //System.out.println("Reading file ");
        while(reader.next(key, value)) {
          System.out.println("Key " + key + "Value " + value);
        }
      }finally {
        if(reader != null) {
          reader.close();
        }
      }
    } catch (IOException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
  }
}

Writing SequenceFile using MapReduce Job

You can also write a sequence file in Hadoop using MapReduce job. That is helpful when you have a big file and you want to take advantage of parallel processing.

The MapReduce job in this case will be a simple one where you don’t even need a reduce job and your Map tasks will just require to write the (key, value) pair.

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile.CompressionType;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class SequenceFileWriter extends Configured implements Tool{
  // Map function
  public static class SFMapper extends Mapper<LongWritable, Text, LongWritable, Text>{
    public void map(LongWritable key, Text value, Context context) 
          throws IOException, InterruptedException {
      context.write(key, value);
    }
  }
  public static void main(String[] args)  throws Exception{
    int exitFlag = ToolRunner.run(new SequenceFileWriter(), args);
    System.exit(exitFlag);      
  }
  @Override
  public int run(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "sfwrite");
    job.setJarByClass(SequenceFileWriter.class);
    job.setMapperClass(SFMapper.class);
    job.setNumReduceTasks(0);
    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
		
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
    // Compression related settings
    FileOutputFormat.setCompressOutput(job, true);
    FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
    SequenceFileOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);
    int returnFlag = job.waitForCompletion(true) ? 0 : 1;
    return returnFlag;
  }	
}

In the MapReduce job for writing a SequenceFile more important thing is the job settings given for output and compression.

Reading SequenceFile using MapReduce Job

If you want to read a sequence file using MapReduce job that code will be very similar to how writing a sequence file is done.

One main change is the input and output formats.

job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class SequenceFileReader extends	Configured implements Tool{
  // Map function
  public static class SFMapper extends Mapper<LongWritable, Text, LongWritable, Text>{
    public void map(LongWritable key, Text value, Context context) 
       throws IOException, InterruptedException {
      context.write(key, value);
    }
  }
  public static void main(String[] args)  throws Exception{
    int exitFlag = ToolRunner.run(new SequenceFileReader(), args);
    System.exit(exitFlag);      
  }
  @Override
  public int run(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "sfread");
    job.setJarByClass(SequenceFileReader.class);
    job.setMapperClass(SFMapper.class);
    job.setNumReduceTasks(0);
    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);
    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
		
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
    int returnFlag = job.waitForCompletion(true) ? 0 : 1;
    return returnFlag;
  }
}

That's all for the topic How to Read And Write SequenceFile in Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment