In a MapReduce job when Map tasks start producing output, the output is sorted by keys and the map outputs are also transferred to the nodes where reducers are running. This whole process is known as shuffle phase in the Hadoop MapReduce . Though shuffle phase is internal to Hadoop framework but there are several configuration parameters to control it. This tuning helps in running your MapReduce job efficiently. In this post we’ll see what happens during sorting and shuffling at both mapper as well as reducer end. Shuffling and sorting at Map end When the map task starts producing output it is first written to a memory buffer which is 100 MB by default. It is configured using mapreduce.task.io.sort.mb parameter in mapred-site.xml. When the memory buffer reaches a certain threshold then only the map output is spilled to the disk. Configuration parameter for it is mapreduce.map.sort.spill.percent which is by default 80% of the allotted memory buffer size . Once this thres
Java, Spring, Web development tutorials with examples