Input Split in Hadoop MapReduce

When a MapReduce job is started to process a file stored in HDFS, one of the thing Hadoop does is to divide the input into logical splits, these splits are known as input splits in Hadoop.

InputSplit represents the data to be processed by an individual map task which means the number of mappers started equals the number of input splits calculated for the job. For example if input data is logically divided into 8 input splits, then 8 mappers will be started to process those input splits in parallel.

Input split is a logical division of data

Input split is just the logical division of the data, it doesn’t contain the physical data. What input split refers to in this logical division is the records in the data. When mapper processes the input split it actually works on the records ((key, value) pairs) with in that input split in Hadoop.

With in Hadoop framework it is the InputFormat class that splits-up the input files into logical InputSplits.

It is the RecordReader class that breaks the data into key/value pairs which is then passed as input to the Mapper.

InputFormat class in Hadoop Framework

public abstract class InputFormat<K, V> {
  public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;

  public abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException;
}

Input Split Vs HDFS blocks

Many people get confused between the HDFS blocks and input splits, since HDFS block is also the division of data into smaller chunks which are then stored across the cluster. Moreover, it is ultimately the data stored in the nodes that is processed by MapReduce job then what actually is the task of input split in Hadoop.

HDFS block is the physical representation of the data, actual data is stored with in the Hadoop Distributed File System. Where as input split is just the logical representation of the data. When data is split into blocks for storing into HDFS it just divides the data into chunks of 128 MB (default block size) with no consideration for record boundaries.

For example if each record is 50 MB then two records will fit with in the block but the third record won’t fit, 28 MBs of the third record will be stored in another block. If a mapper processes a block then it won’t be able to process the third record as it won’t get the full record.

Input split which is logical representation of the data honors logical record boundaries. Using the starting record in the block and the byte offset it can get the complete record even if it spans the block boundaries. Thus the mapper working on the input split will be able to process all 3 records even if part of third record is stored in another block.

That's all for the topic Input Split in Hadoop MapReduce. If something is missing or you have something to share about the topic please write a comment.

You may also like

KnpCode

December 21, 2023