Introduction to Big data Hadoop
“Big Data” is a concept that is crucial to driving growth for businesses and was also a challenge for the programmers to analyse. The solution to this challenge is achieved by a framework called “Hadoop“. Hadoop Framework has overcome the Big Data challenges with the help of a file system concept called – “Hadoop Distributed File System” & “MapReduce” which I call as – the Brain & Heart of the Hadoop respectively.
Hadoop ecosystem is evolving day-by-day. It supports Batch processing, real-time processing. Both Programmers(who can code – Java, python, Ruby etc.) & non-programmer can leverage Hadoop framework for Big Data analytics.
The Hadoop framework is composed of the following modules
- Hadoop-Common: contains libraries and utilities needed by other Hadoop modules
- Hadoop-Distributed File System : a distributed file-system stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.
- Hadoop-YARN: a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications
- Hadoop-MapReduce: a programming model for large scale data processing
All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and thus should be automatically handled in software by the framework. Apache Hadoop’s MapReduce and HDFS components originally derived respectively from Google’s MapReduce and Google File System papers.
Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop “platform” is now commonly considered to consist of a number of related project as well: Apache Pig, Apache Hive, Apache HBase, and others.
For the end-users, though MapReduce Java code is common, any programming language can be used with “Hadoop Streaming” to implement the “map” and “reduce” parts of the user’s program. Apache Pig and Apache Hive, among other related projects, expose higher level user interfaces like Pig latin and a SQL variant respectively. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts.
HDFS and MapReduce
There are two primary components at the core of Apache Hadoop 1.x: the Hadoop Distributed File System (HDFS) and the MapReduce parallel processing framework. These are both open source projects, inspired by technologies created inside Google.
Hadoop distributed file system
The Hadoop distributed file system is a distributed and portable file-system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single name node, and a cluster of data nodes form the HDFS cluster. The situation is typical because each node does not require a data node to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication. Clients use Remote procedure call (RPC) to communicate between each other.