Skip to main content


Showing posts from June, 2019

How MapReduce Works in Hadoop

In the post WordCount MapReduce program we have seen how to write a MapReduce program in Java, create a jar and run it. There are a lot of things that you do to create a MapReduce job and Hadoop framework also do a lot of processing internally. In this post we’ll see in detail how MapReduce works in Hadoop internally  using the word count MapReduce program as example. What is MapReduce Hadoop MapReduce is a framework for writing applications that can process huge data in parallel, by working on small chunks of data in parallel on cluster of nodes. The framework ensures that this distributed processing happens in a reliable, fault-tolerant manner. Map and Reduce A MapReduce job in Hadoop consists of two phases- Map phase – It has a Mapper class which has a map function specified by the developer. The input and output for Map phase is a (key, value) pair. When you copy the file that has to be processed to HDFS it is split into independent chunks. Hadoop framework creates o

Hadoop MapReduce Word Count Program

Once you have installed Hadoop on your system and initial verification is done you would be looking to write your first MapReduce program. Before digging deeper into the intricacies of MapReduce programming first step is the word count MapReduce program in Hadoop which is also known as the "Hello World" of the Hadoop framework. So here is a simple Hadoop MapReduce word count program written in Java to get you started with MapReduce programming. What you need It will be good if you have any IDE like Eclipse to write the Java code. A text file which is your input file. It should be copied to HDFS. This is the file which Map task will process and produce output in (key, value) pairs. This Map task output becomes input for the Reduce task. Process These are the steps you need for executing your Word count MapReduce program in Hadoop. Start daemons by executing the start-dfs and start-yarn scripts. Create an input directory in HDFS where you will keep

What is Big Data

Big Data means a very large volume of data. Term big data is used to describe data so huge and ever growing that has gone beyond the storage and processing capabilities of traditional data management and processing tools. Some Examples Facebook which stores data about your posts, notification clicks, post likes, photos uploaded generates about 600 TB of data everyday, which means 18 Petabyte of data in a month. Reference : The NCCS (NASA Center for Climate Simulation) which focuses on climate and weather data houses around 32 petabytes of data. Size of the Climate Change data repositories alone are projected to grow to nearly 350 Petabytes by 2030. Reference : Wal-Mart handles more than a million customer transactions each hour and imports those into databases estimated to contain more than 2.5 petabytes

How to Improve Map-Reduce Performance

In this post we’ll see some of the ways to improve performance of the Map-Reduce job in Hadoop. The tips given here for improving the performance of MapReduce job are more from the MapReduce code and configuration perspective rather than cluster and hardware perspective. 1- Enabling uber mode – Like Hadoop 1 there is no JVM resuse feature in YARN Hadoop but you can enable the task to run in Uber mode, by default uber is not enabled. If uber mode is enabled ApplicationMaster can calculate that the overhead of negotiating resources with ResourceManager, communicating with NodeManagers on different nodes to launch the containers and running the tasks on those containers is much more that running MapReduce job sequentially in the same JVM, it can run a job as uber task . 2- For compression try to use native library - When using compression and decompression in Hadoop it is better to use native library as native library will outperform codec written in programming language lik

OutputCommitter in Hadoop MapReduce

In Hadoop framework distributed processing happens where map and reduce tasks are spawned on different nodes and process part of the data. In this type of distributed processing it is important to ensure that framework knows when a particular task finishes or there is a need to abort the task and when does the over all job finish. For that purpose, like many other distributed systems, Hadoop also uses commit protocol. Class that implements it in Hadoop is OutputCommitter. OutputCommitter class in Hadoop describes the commit of task output for a Map-Reduce job. In Hadoop 2 OutputCommitter implementation can be set using the getOutputCommitter() method of the OutputFormat class. FileOutputCommitter is the default OutputCommitter. Note that OutputCommitter is an abstract class in Hadoop framework which can be extended to provide OutputCommitter implementation. Tasks performed by OutputCommitter in Hadoop Map-Reduce 1-Setup the job during initialization. For example, c

GenericOptionsParser And ToolRunner in Hadoop

When you run MapReduce program from command line you provide the jar name, the class that has the code, input and output paths in HDFS. That’s the bare minimum you have to provide to run a MapReduce job. There may be other configurations that you can set with in your driver class using conf.set() method. But there is a drawback to setting configurations with in the code, any configuration change would require code change, repackaging the jar and then run it. To avoid that you can opt to provide configurations through the command line at the time of execution. For that purpose you can use GenericOptionsParser class in Hadoop . GenericOptionsParser class in Hadoop GenericOptionsParser class is a utility class with in the org.apache.hadoop.util package. This class parses the standard command line arguments and sets them on a configuration object which can then be used with in the application. The conventional way to use GenericOptionsParser class is to implement Tool interface

How to Chain MapReduce Job in Hadoop

In many scenarios you would like to create a sequence of MapReduce jobs to completely transform and process the data. This is better than putting every thing in a single MapReduce job job and making it very complex. In fact you can get your data through various sources and use a sequence of various applications too. That can be done by creating a work flow using Oozie but that is a topic for another post. In this post we’ll see how to chain MapReduce job in Hadoop using ChainMapper and ChainReducer. Table of contents ChainMapper in Hadoop ChainReducer in Hadoop Chaining MapReduce job MapReduce chaining example ChainMapper in Hadoop ChainMapper is one of the predefined MapReduce class in Hadoop. ChainMapper class allows you to use multiple Mapper classes within a single Map task . The Mapper classes are invoked in a chained fashion where the output of the first mapper becomes the input of the second, and so on until the last Mapper, the output of the las

Distributed Cache in Hadoop

In this post we’ll see what Distributed cache in Hadoop is. Table of contents What is a distributed cache Methods for adding the files in Distributed Cache How to use distributed cache Distributed cache example MapReduce code What is a distributed cache As the name suggests distributed cache in Hadoop is a cache where you can store a file (text, archives, jars etc.) which is distributed across the nodes where mappers and reducers for the MapReduce job are running. That way the cached files are localized for the running map and reduce tasks. Methods for adding the files in Distributed Cache There is a DistributedCache class with relevant methods but the whole class is deprecated in Hadoop2. You should be using the methods in Job class instead. public void addCacheFile(URI uri) - Add a file to be localized. public void addCacheArchive(URI uri) - Add archives to be localized. public void addFileToClassPath(Path file) - Adds file path to the curre

Combiner in Hadoop MapReduce

This post shows what is combiner in Hadoop MapReduce and how combiner function can be used to reduce the overall memory, I/O and network requirement of the overall MapReduce execution. Table of contents Why is combiner needed in MapReduce Combiner function in MapReduce How to specify a combiner in MapReduce job MapReduce Example using combiner Why is combiner needed in MapReduce When a MapReduce job is executed and the mappers start producing output a lot of processing happens with in the Hadoop framework knows as the shuffling and sorting phase . Map output is partitioned based on the number of reducers, those partitions are also sorted and then written to local disk. Then the data, from the nodes where maps are running, is transferred to the nodes where reducers are running. Since a single reducer will get its input from several mappers so all that data from several maps is transferred to the reducer and merged again to form the complete input for the

Mapper Only Job in Hadoop MapReduce

Generally when we think of MapReduce job in Hadoop we think of both mappers and reducers doing their share of processing. That is true for most of the cases but you can have scenarios where you want to have a mapper only job in Hadoop . When do you need map only job You may opt for a map only job in Hadoop when you do need to process the data and get the output as (key, value) pairs but don’t want to aggregate those (key, value) pairs . For example – If you are converting a text file to sequence file using MapReduce . In this case you just want to read a line from text file and write it to a sequence file so you can opt for a MapReduce with only map method. Same way if you are converting a text file to parquet file using MapReduce you can opt for a mapper only job in Hadoop. What you need to do for mapper only job For a mapper only job you need to write only map method in the code, which will do the processing. Number of reducers is set to zero. In order to set the numbe

How to See Logs And Sysouts in Hadoop MapReduce

While writing a program, in order to debug we do put some logs or system.out to display messages. In your MapReduce program also you can use logger or sysouts for debugging purposes. In this post we’ll see how you can access those logs or system.out.print messages in Hadoop MR2. How to see log messages in MapReduce2 First thing of course is to put logs in your code. Then at the time of running your MapReduce job you can note the application_id of the job from the console. Once you run your MapReduce job you will get a line as following displayed on the console showing the application id. 18/06/13 15:20:59 INFO impl.YarnClientImpl: Submitted application application_1528883210739_0001 With the same application_id a folder will be created in the location HADOOP_INSTALLATION_DIR/logs/userlogs/ there you will find folders having logs for your mappers and reducers. In those folders you can check stdout file for any system.out.print and syslog for log messages. Example MapReduce