December 29, 2023

OutputCommitter in Hadoop MapReduce

In Hadoop framework distributed processing happens where map and reduce tasks are spawned on different nodes and process part of the data. In this type of distributed processing it is important to ensure that framework knows when a particular task finishes or there is a need to abort the task and when does the over all job finish. For that purpose, like many other distributed systems, Hadoop also uses commit protocol. Class that implements it in Hadoop is OutputCommitter.

OutputCommitter class in Hadoop describes the commit of task output for a Map-Reduce job. In Hadoop 2 OutputCommitter implementation can be set using the getOutputCommitter() method of the OutputFormat class. FileOutputCommitter is the default OutputCommitter. Note that OutputCommitter is an abstract class in Hadoop framework which can be extended to provide OutputCommitter implementation.

Tasks performed by OutputCommitter in Hadoop Map-Reduce

  1. Setup the job during initialization. For example, create the temporary output directory for the job during the initialization of the job. For that setupJob() method is used. This is called from the application master process for the entire job. This will be called multiple times, once per job attempt.
  2. Cleanup the job after the job completion. For example, remove the temporary output directory after the job completion.

    Cleaning up is done when either commitJob() or abortJob() method is called.

    • commitJob() method is invoked for jobs with final runstate as SUCCESSFUL. This is called from the application master process for the entire job. This method is guaranteed to only be called once to maintain atomicity.
    • abortJob() method is invoked for jobs with final runstate as JobStatus.State.FAILED or JobStatus.State.KILLED. This is called from the application master process for the entire job. This may be called multiple times.
  3. Setup the task temporary output. This is done by invoking setupTask() method. This method is called from each individual task's process that will output to HDFS, and it is called just for that task. This may be called multiple times for the same task, but for different task attempts.
  4. Check whether a task needs a commit. This is to avoid the commit procedure if a task does not need commit.
    Checking if the task needs commit is done using needsTaskCommit() method. This method returning false means commit phase is disabled for the tasks.
  5. Commit of the task output. During this step task's temporary output is promoted to final output location.
    Method used is commitTask(). There may be multiple task attempts for the same task, Hadoop framework ensures that the failed task attempts are aborted and only one task is committed.
  6. Discard the task commit. If a task doesn't finish abortTask() method is called. This method may be called multiple times for the same task, but for different task attempts.

That's all for the topic OutputCommitter in Hadoop MapReduce. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment