Skip to main content


Showing posts from April, 2019

Shuffle Phase in Hadoop MapReduce

In a MapReduce job when Map tasks start producing output, the output is sorted by keys and the map outputs are also transferred to the nodes where reducers are running. This whole process is known as shuffle phase in the Hadoop MapReduce . Though shuffle phase is internal to Hadoop framework but there are several configuration parameters to control it. This tuning helps in running your MapReduce job efficiently. In this post we’ll see what happens during sorting and shuffling at both mapper as well as reducer end. Shuffling and sorting at Map end When the map task starts producing output it is first written to a memory buffer which is 100 MB by default. It is configured using parameter in mapred-site.xml. When the memory buffer reaches a certain threshold then only the map output is spilled to the disk. Configuration parameter for it is which is by default 80% of the allotted memory buffer size . Once this thres

What is Hadoop

Apache Hadoop is an open source framework for storing data and processing of data set of big data on a cluster of nodes (commodity hardware) in parallel. Hadoop framework is designed to scale up from single server to thousand of machines with each machine offering both storage and computation. It is also reliable and fault tolerant, framework itself is designed to detect and handle failures at the application layer, that way Hadoop framework provides a highly-available service using a cluster of nodes. Modules of Hadoop Hadoop framework is written in Java and it includes these modules- Hadoop Common – This module contains libraries and utilities used by other modules. Hadoop Distributed File System (HDFS) – This is the storage part of the Hadoop framework. It is a distributed file system that works on the concept of breaking the huge file into blocks and storing those blocks in different nodes. That way HDFS provides high-throughput access to application data. Hado