November 3, 2022

What is Hadoop

Apache Hadoop is an open source framework for storing data and processing of data set of big data on a cluster of nodes (commodity hardware) in parallel.

Hadoop framework is designed to scale up from single server to thousand of machines with each machine offering both storage and computation. It is also reliable and fault tolerant, framework itself is designed to detect and handle failures at the application layer, that way Hadoop framework provides a highly-available service using a cluster of nodes.

Modules of Hadoop

Hadoop framework is written in Java and it includes these modules-

  1. Hadoop Common– This module contains libraries and utilities used by other modules.
  2. Hadoop Distributed File System (HDFS)– This is the storage part of the Hadoop framework. It is a distributed file system that works on the concept of breaking the huge file into blocks and storing those blocks in different nodes. That way HDFS provides high-throughput access to application data.
  3. Hadoop Yarn (Yet Another Resource Negotiator)– This module is responsible for scheduling jobs and managing cluster resources. Refer YARN in Hadoop to read more about YARN.
  4. Hadoop MapReduce– This is the implementation of the MapReduce programming model to process the data in parallel.
Modules in Hadoop

Brief history of Hadoop

Hadoop was created by Doug Cutting and it has its origins in Nutch which is an open source web crawler. When Doug Cutting and Mike Cafarella were working on Nutch and trying to scale it they came across two google white papers about GFS (Google’s Distributed File System) and MapReduce. Using the architecture described in those papers Nutch’s developers came up with open source implementation of distributed file system NDFS (Nutch Distributed File System) and MapReduce.

It was realized that NDFS and MapReduce can be created as a separate project and that way Hadoop initially became a sub-project. Yahoo also helped by providing resources and team to develop the framework by improving scalability, performance, and reliability and adding many new features. In 2008 Hadoop became a top-level project in Apache rather than being a sub-project and now it is a widely used framework with its own ecosystem.

How Hadoop works

Here I’ll try to explain how Hadoop works in very simple terms without going into the complexities what all daemons like NameNode or Resource Manager do.

Once you copy a huge file into HDFS, framework splits the file into blocks and distribute those blocks across nodes in a cluster.

Then you write a MapReduce program having some logic to process that data. You package your code as a jar and that packaged code is transferred to DataNodes where data blocks are stored. That way your MapReduce code work on the part of the file (HDFS block that resides on the node where code is running) and processes data in parallel.

Other advantage is that rather than sending data to code (like traditional programming where data is fetched from DB server) you send the code to data. Obviously data is much larger in size so that way Hadoop uses network bandwidth more proficiently.

Here is a high level diagram which tells in a simple way how Hadoop framework works.

How Hadoop works

That's all for the topic What is Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment