June 29, 2022

Predefined Mapper and Reducer Classes in Hadoop

With in Hadoop framework there are some predefined Mapper and Reducer classes which can be used as is in the required scenarios. That way you are not required to write mapper or reducer for those scenarios, you can use ready made classes instead.

Let's see some of the predefined Mapper and Reducer classes in Hadoop.

Predefined Mapper classes in Hadoop

  1. InverseMapper- This predefined mapper swaps keys and values. So the input (key, value) pair is reversed and the key becomes value and value becomes key in the output (key, value) pair.
  2. TokenCounterMapper- This mapper tokenizes the input values and emit each word with a count of 1. So the mapper you write in case of word count MapReduce program can be replaced by this inbuilt mapper. See an example word count program using TokenCounterMapper and IntSumReducer.
  3. MultithreadedMapper- This is the multi-threaded implementation of Mapper. Mapper implementations using this MapRunnable must be thread-safe.
  4. ChainMapper- The ChainMapper class allows to use multiple Mapper classes within a single Map task. The Mapper classes are invoked in a chained fashion, the output of the first mapper becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output.

    Refer How to Chain MapReduce Job in Hadoop to see an example of chained mapper and chained reducer along with InverseMapper.

  5. FieldSelectionMapper- This class implements a mapper class that can be used to perform field selections in a manner similar to Unix cut. The input data is treated as fields separated by a user specified separator. The user can specify a list of fields that form the map output keys, and a list of fields that form the map output values. See an example using FieldSelectionMapper later.
  6. RegexMapper- This predefined Mapper class in Hadoop extracts text from input that matches a regular expression.

Predefined Reducer classes in Hadoop

  1. IntSumReducer- This predefined Reducer class will sum the integer values associated with the specific key.
  2. LongSumReducer- This predefined Reducer class will sum the long values associated with the specific key.
  3. FieldSelectionReducer- This class implements a reducer class that can be used to perform field selections in a manner similar to Unix cut. The input data is treated as fields separated by a user specified separator. The user can specify a list of fields that form the reduce output keys, and a list of fields that form the reduce output values. The fields are the union of those from the key and those from the value.
  4. ChainReducer- The ChainReducer class allows to chain multiple Mapper classes after a Reducer within the Reducer task. For each record output by the Reducer, the Mapper classes are invoked in a chained fashion. The output of the reducer becomes the input of the first mapper and output of first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output.
  5. WrappedReducer- A Reducer which wraps a given one to allow for custom Reducer.Context implementations. This Reducer is useful if you want provide implementation of Context interface.

Examples using predefined Mapper and Reducer classes

Here are some examples using predefined Mapper and Reducer classes.

Using FieldSelection Mapper

In the example there is tab separated input data and you want to extract field 0 as key and field 1 as value. In this scenario you can use FieldSelectionMapper rather than writing your own mapper.

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.fieldsel.FieldSelectionHelper;
import org.apache.hadoop.mapreduce.lib.fieldsel.FieldSelectionMapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class StockPrice extends Configured implements Tool{
  // Reduce function
  public static class MaxStockPriceReducer extends Reducer<Text, Text, Text, IntWritable>{

    public void reduce(Text key, Iterable values, Context context) 
        throws IOException, InterruptedException {
    System.out.println("key -- " + key.toString());
    int	maxValue = Integer.MIN_VALUE;
    for (Text val : values) {
      System.out.println("Value -- " + val);
      if(val != null && !val.toString().equals("")) {
        maxValue = Math.max(maxValue, Integer.parseInt(val.toString()));
      }
    }    
    System.out.println("maxValue -- " + maxValue);
    context.write(key, new IntWritable(maxValue));
    }
  }
	
	
  public static void main(String[] args) throws Exception {
    int exitFlag = ToolRunner.run(new StockPrice(), args);
    System.exit(exitFlag);
  }
	
  @Override
  public int run(String[] args) throws Exception {
    Configuration conf = new Configuration();
    // setting the separator
    conf.set(FieldSelectionHelper.DATA_FIELD_SEPERATOR, "\t");
    // Setting the fields that are to be extracted
    conf.set(FieldSelectionHelper.MAP_OUTPUT_KEY_VALUE_SPEC, "0:1");
    Job job = Job.getInstance(conf, "Stock price");
    job.setJarByClass(getClass());
    // setting the predefined mapper
    job.setMapperClass(FieldSelectionMapper.class);    

    job.setReducerClass(MaxStockPriceReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    return job.waitForCompletion(true) ? 0 : 1;
  }
}

Using TokenCounterMapper and IntSumReducer to write a word count MapReduce program

In the post Word Count MapReduce Program in Hadoop we have seen a word count MR program where Map and Reduce function are written with in the program but you can write a word count MR program using predefined Mapper and Reducer classes where you just need to specify the classes TokenCounterMapper (predefined Mapper class) and IntSumReducer (predefined Reducer class).

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class SimpleWordCount extends Configured implements Tool{

  public static void main(String[] args) throws Exception{
    int exitFlag = ToolRunner.run(new SimpleWordCount(), args);
    System.exit(exitFlag);
  }

  @Override
  public int run(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "WC");
    job.setJarByClass(getClass());
    // Setting pre-defing mapper and reducer
    job.setMapperClass(TokenCounterMapper.class);    
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    return job.waitForCompletion(true) ? 0 : 1;
  }
}

That's all for the topic Predefined Mapper and Reducer Classes in Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment