November 4, 2022

GenericOptionsParser And ToolRunner in Hadoop

When you run MapReduce program from command line you provide the jar name, the class that has the code, input and output paths in HDFS. That’s the bare minimum you have to provide to run a MapReduce job. There may be other configurations that you can set with in your driver class using conf.set() method. But there is a drawback to setting configurations with in the code, any configuration change would require code change, repackaging the jar and then run it. To avoid that you can opt to provide configurations through the command line at the time of execution. For that purpose you can use GenericOptionsParser class in Hadoop.

GenericOptionsParser class in Hadoop

GenericOptionsParser class is a utility class with in the org.apache.hadoop.util package. This class parses the standard command line arguments and sets them on a configuration object which can then be used with in the application.

The conventional way to use GenericOptionsParser class is to implement Tool interface and then use ToolRunner to run your application. ToolRunner internally uses GenericOptionsParser to parse the generic Hadoop command line arguments and then modify the Configuration of the Tool by setting the command line arguments.

Supported generic options

Options that are supported by ToolRunner through GenericOptionsParser are as follows-

  • -conf <configuration file>- Specify an application configuration file. So you can prepare an XML file and set it using -conf option that way you can set many properties at once.
  • -D <property>=<value>- Sets value for given property. Specifying a property with -D option will override any property with the same name in the configuration file or with in the driver code.
  • -fs <> or <hdfs://namenode:port>- This generic option is used to specify default filesystem URL to use. Overrides ‘fs.defaultFS’ property from configurations.
  • -jt <local> or <resourcemanager:port>- Used to set YARN ResourceManager.
  • -files <comma separated list of files>- Specify comma separated files to be copied to the map reduce cluster. Applies only to job. If you want to add a file to distributed cache then rather than hardcoding it with in your driver code by using job.addCacheFile() method you can specify it using -files generic option.
  • -libjars <comma seperated list of jars>- Specify comma separated jar files to include in the classpath. Applies only to job. If you want to add a jar to distributed cache then rather than hardcoding it with in your driver by using job.addFileToClassPath() method you can specify it using -libjars generic option.
  • -archives <comma separated list of archives>- Specify comma separated archives to be unarchived on the compute machines. Applies only to job. If you want to add an archived file (zip, tar and tgz/tar.gz files) then rather than hard coding it with in your driver by using job.addCacheArchive() method you can specify it using -libjars generic option.

Examples using generic options

1- If you want to set a configuration file.

hadoop jar test.jar com.knpcode.MyClass -conf hadoop/conf/my-hadoop-config.xml /inputfile /outputfile

2- If you want to set value for any configuration. As example setting the number of reducers to 10.

hadoop jar test.jar com.knpcode.MyClass -D mapreduce.job.reduces=10 /inputfile /outputfile
Note that mapred.reduce.tasks property is deprecated, mapreduce.job.reduces property should be used instead.

3- Setting files, jars and archived file in distributed cache.

hadoop jar test.jar com.knpcode.MyClass -files /input/test.txt -libjars /lib/test.jar /inputfile /outputfile

That's all for the topic GenericOptionsParser And ToolRunner in Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment