What is the image that comes to your mind when you think of big data? Are they series of codes, numbers, and characters streaming across documents? Are they millions or rows data points recorded in rows and columns? Well, we can visualize it add poetry to it the way we want just for the fun of it. What do we do when we need to manage these monsters for our scientific and business purposes?
We need to we set up systems to get, organize and process data to get the required output.
Apache Hadoop is a framework to process this vast amount of distributed data. It is a layered structure that with its own approaches to getting the task done. To get a short introduction to Hadoop, please visit our article ‘A Short and Sweet Introduction to Hadoop’.
The data we collect is stored in the form of files. A file system is a structure to manage these files on a disk.
Hadoop Distributed File System is the lowest layer of the Hadoop framework. It is designed to manage the massive amount of data and to apply specific tasks on it. MapReduce is the programming logic submitted by the user that decides how data must be manipulated within the lower level file system to complete the required task.
So, what does this HDFS actually, physically look like? It consists of a cluster of systems. Imagine chunks of hardware placed inside tall steel racks. These pieces of hardware are communicating with each through routers and the racks themselves communicate with each other through uplink switches.
The way in which Hadoop 1.0 and Hadoop 2.0 are deployed is different. For better understanding, we are illustrating Hadoop 1.0 first.
When you have so such huge amounts of data coming in, and complex computations to be done on them the whole process requires a management system to function smoothly. The components of HDFS are the NameNode, the Secondary Name Node, the Data Node, the Job Tracker and the Task Tracker.
These components are divided in to master nodes and slave nodes in the management system.
The name node, secondary name node, and job tracker are master nodes and data nodes and task tracker are slave nodes. All these nodes are software that performs their own particular tasks when they are installed on commodity hardware.
The Name Node – The Manager
The name node is the manager that manages the incoming data and places them in their respective data nodes, maintains the file system tree, gets periodic updates from the data nodes regarding its status and maintains the metadata and the file system snapshot. It has a higher RAM than the rest of the hardware because a name node’s functions are RAM intensive.
When new data needs to be written, the client contacts the name node. The name node decides the size of the blocks, their names and the data nodes to which they belong to are decided by the name node. But the actual splitting of data happens in the data node. Similarly, when the data needs to be read from the system, the client contacts the name node and it directs the client to the respective data node and from then on, the communication happens between the client and the data node.
It also maintains other important information about the files like the permissions, access time, disc space quota, and metadata like the access time and the replication factor (by default each block is replicated 3 times and placed onto 3 different data nodes).
The data node has to contact the name node every three seconds to announce its state. If the name node does not get signal from the data node for more than ten minutes, it is declared dead.
So, what if this name node fails? In Hadoop 1.0, the cluster goes down and the secondary name node needs to be started manually. However, with Hadoop 2.0, the secondary name nodes are on hot standby and they will start when the name node goes down.
The Secondary Name node – The Secretary
In Hadoop 1.0 they maintain the namespaces and metadata and will have to be manually started when the name node goes down.
The editing of file namespace is done by the secondary name node and gets sent back to the name node. This snapshot of the system when it was started is stored in the name node as ‘fsImage’. Every time we change some information about the file like the name, it is recorded by the name node in logs. Every hour, the secondary name node gets this log of edits from the name node and rolls this new set of edits to form a temporary fsImage. This is sent to the name node and it is set to fsImage.
Why complicate by sending the edits and loading it again through the secondary name node? Why can’t name node do this on its own?
The reason is that name node can merge edits of the file only during a system restart and this not done very often. So, the edits just pile up and will not be merged to the fsImage until the system is restarted again. Therefore, we need an intermediary who will help the name node manage its file edits. This is called checkpointing.
The Data Node – The Associate
The client is directed to the data node by the name node during a read or write process. The client asks the name node for block size and the client machine breaks the data into blocks. Once the data is broken, the name node provides the client with the list of data nodes that have enough space to accommodate the blocks. The client writes data on the data nodes. The data node also checks with the name node before the client write request is accepted.
This data that is written once can be read many times in batches. Data is read in batches because they are stored on the hard disk in a compressed form (compression ratio of 3). The HDFS is good for very large files (minimum size 64 MB) and not good for a very large number of small files (because of the I/O latency) since decompression of files take up a lot of CPU cycle
The data node stores the block, deletes the block, replicates it according to the replication factor decided by the name node and contacts the name node every three seconds to announce its state. If the name node does not get signal from the data node for more than ten minutes, it is declared dead. The data node also sends the name node block updates every six hours.
The Job Tracker – The Administrator
In HDFS, a job is a program working on a piece of data to get results. The job is submitted to the respective nodes as jar files. The job tracker is in charge of the job scheduling and takes care of the way in which the jobs are executed.
When it gets a job (program) from the client, it talks to the name node and finds the location of the data to operate on. It then submits the job to the task tracker node near the data. The task tracker node then performs the map reduce functions on the data to get the output. The job tracker is updated upon the completion of the task.
If the task tracker fails, the job tracker is informed where it can assign the task to some other task tracker node and can also blacklist the task tracker to avoid future assignment.
The Task Tracker – The Developer
The task tracker is a node that performs the map, reduce and shuffle tasks that it receives from the form of a job. The task tracker nodes are slots within the steel rack. When the job tracker gets a job, it looks for empty slots in the data node where the data resides. If it cannot find an empty slot in the same node, it looks for an empty slot in the same rack.
The task tracker notifies its status constantly to the job tracker through heartbeat signals.