Big Data and Hadoop Interview Questions
These Big Data and Hadoop Interview Questions and Answers has been created by MindsMapped, especially to get you acquainted with the types of questions you would be asked during your Hadoop interview. This list of questions and answers are enough for you to face different levels of Java interviews in a confident way. The list of Questions and Answers will be updated from time to time.
Frequently Asked Big Data and Hadoop Interview Questions And Answers
Basic Hadoop Interview Questions
1) What is Hadoop?
Hadoop is a distributed computing platform. It is written in Java. It consist of the features like Distributed File System and MapReduce Processing.
2) What platform and Java version is required to run Hadoop?
Java 1.6.x or higher version are good for Hadoop, preferably from Sun. Linux and Windows are the supported operating system for Hadoop, but BSD, Mac OS/X and Solaris are more famous to work.
3) What kind of Hardware is best for Hadoop?
Hadoop can run on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It depends on the workflow needs.
4) What are the most common input formats defined in Hadoop?
These are the most common input formats defined in Hadoop:
- TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat
- TextInputFormat is a by default input format.
5) What is Input Block in Hadoop? Explain.
When a Hadoop job runs, it blocks input files into chunks and assign each split to a mapper for processing. It is called Input block.
6) How many Input blocks is made by a Hadoop Framework?
The default block size is 64MB, according to which, Hadoop will make 5 Block as following:
- One Block for 64K files
- Two Block for 65MB files, and
- Two Block for 127MB files
The block size is configurable.
7) What is the use of RecordReader in Hadoop?
Input Block is assigned with a work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.
8) What is JobTracker in Hadoop?
JobTracer is a service within Monitors and assigns Map tasks and Reduce tasks to corresponding task tracker on the data nodes
9) What are the functionalities of JobTracker?
These are the main tasks of JobTracker:
- To accept jobs from client.
- To communicate with the NameNode to determine the location of the data.
- To locate TaskTracker Nodes with available slots.
- To submit the work to the chosen TaskTracker node and monitors progress of each tasks.
10) Define TaskTracker.
TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations from a JobTracker.
11) What is Map/Reduce job in Hadoop?
MapReduce is programming paradigm which is used to allow massive scalability across the thousands of server.
Actually MapReduce refers two different and distinct tasks that Hadoop performs. In the first step maps jobs which takes the set of data and converts it into another set of data and in the second step, Reduce job. It takes the output from the map as input and compress those data tuples into smaller set of tuples. Click here for more information about MapReduce.
12) What is Hadoop Streaming?
Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.
13) What is a combiner in Hadoop?
A Combiner is a mini-reduce process which operates only on data generated by a Mapper. When Mapper emits the data, combiner receives it as input and sends the output to reducer.
14) Is it necessary to know java to learn Hadoop?
If you have a background in any programming language like C, C++, PHP, Python, Java etc. It may be really helpful, but if you are nil in java, it is necessary to learn Java and also get the basic knowledge of SQL.
15) How to debug Hadoop code?
There are many ways to debug Hadoop codes but the most popular methods are:
- By using Counters.
- By web interface provided by Hadoop framework.
16) Is it possible to provide multiple inputs to Hadoop? If yes, explain.
Yes, It is possible. The input format class provides methods to insert multiple directories as input to a Hadoop job.
17) What is the relation between job and task in Hadoop?
In Hadoop, A job is divided into multiple small parts known as task.
18) What is distributed cache in Hadoop?
Distributed cache is a facility provided by MapReduce Framework. It is provided to cache files (text, archives etc.) at the time of execution of the job. The Framework copies the necessary files to the slave node before the execution of any task at that node.
19) What commands are used to see all jobs running in the Hadoop cluster and kill a job in LINUX?
- Hadoop job – list
- Hadoop job – kill jobID
20) What is the functionality of JobTracker in Hadoop? How many instances of a JobTracker run on Hadoop cluster?
JobTracker is a giant service which is used to submit and track MapReduce jobs in Hadoop. Only one JobTracker process runs on any Hadoop cluster. JobTracker runs it within its own JVM process.
Functionalities of JobTracker in Hadoop:
- When client application submits jobs to the JobTracker, the JobTracker talks to the NameNode to find the location of the data.
- It locates TaskTracker nodes with available slots for data.
- It assigns the work to the chosen TaskTracker nodes.
- The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then. It may resubmit the task on another node or it may mark that task to avoid.
21) How JobTracker assign tasks to the TaskTracker?
The TaskTracker periodically sends heartbeat messages to the JobTracker to assure that it is alive. This messages also inform the JobTracker about the number of available slots. This return message updates JobTracker to know about where to schedule task.
22) Is it necessary to write jobs for Hadoop in Java language?
No, there are many ways to deal with non-java codes. Hadoop Streaming allows any shell command to be used as a map or reduce function.
Hive Interview Questions and Answers
23) What is Apache Hive?
Apache Hive is a data warehouse software which is used to facilitate managing and querying large data sets stored in distributed storage. Hive also permits traditional MapReduce programs to customize mappers and reducers when it is inefficient to run the logic in HiveQL.
24) How Facebook Uses Hadoop, Hive and HBase?
Facebook data is stored on HDFS, daily numerous photos uploaded within Facebook server with the help of Facebook Messages, Likes and statues updates running on top of HBase. Hive generate reports for third party developers and advertisers who need to find the success of their campaigns or applications.
25) What is the difference between HBase and Hive?
Apache Hive is a data warehouse infrastructure which is built on top of Hadoop. Hive permits for querying data that is stored on HDFS for analysis via HQL, an SQL-like coding language that gets converted to MapReduce jobs. In spite of providing SQL functionality, Hive doesn’t provide interactive querying – it only executes batch processes on Apache Hadoop.
Apache HBase is a NoSQL value store, that runs on top of Hadoop File System. HBase operations executes in real-time on database instead of MapReduce jobs. HBase is divided into tables, and tables are further divided into column families. Column families, must be declared in schema.
26) What is Hive Metastore?
Hive Metastore is a database that saves metadata of your hive tables including table name, data types, column name,table location, number of buckets, etc.
27) Hive new version supported Hadoop Versions?
Latest version of Hive is 2.0
28) Which companies are mostly using Hive?
Facebook and Netflix
29) Wherever (Different Directory) I run hive query, it creates new metastore_db, please explain the reason for it?
Whenever you execute the Apache Hive in an embedded mode, it creates the local Metastore. Before creating Metastore Hive looks whether Metastore is already created or not. This property is detailed in configuration file Hive – site.xml. Property is “javax.jdo.option.ConnectionURL” with default value “jdbc:derby:;databaseName=metastore_db;create=true”.
30) Is it possible to use same metastore by multiple users, in case of embedded Hive?
No, Metastore can’t be used in sharing mode. It is suggested to use stand-alone “real” database such as MySQL or PostGresSQL.
31) What is the usage of Query Processor in Apache Hive?
Query processing implements the processing framework for translating SQL to a graph of MapReduce jobs.
32) Is multi line comment supported in HIVE Script?
NO
33) What is a Hive Metastore?
It is a central repository that saves metadata in external database.
34) Explain about the SMB Join in Hive.
In Sort Merge Bucket (SMB) join in Hive, all mapper reads a bucket from the 1st table and the equivalent bucket from the second table and then a merge sort join is conducted. SMB is mainly utilized as there is no limit on partition or file or table join. It can best be utilized when the tables are very large. In SMB join the columns are sorted and bucketed using the join columns. In SMB join all tables must have the same number of buckets.
35) Explain about the different types of join in Hive.
HiveQL has 4 different types of joins –
- JOIN- Similar to Outer Join in SQL
- FULL OUTER JOIN– It combines the records of both the right and left outer tables that fulfil the join condition.
- RIGHT OUTER JOIN– Each of the rows from the right table are reverted even though there are no match in the left table.
- LEFT OUTER JOIN– Each of the rows from the left table are reverted even though there are no match in the right table.
36) What is ObjectInspector usage?
ObjectInspector is utilized to analyze the internal structure of the row objects and the structure of individual columns. ObjectInspector in Hive allows access to complex objects that can be saved in various formats.
37) Is it possible to change the default location of Managed Tables in Hive, if so how?
Yes, you can alter the default location of managed tables by utilizing the LOCATION keyword while creating the managed table. The user has to specify the path of the managed table as the value to the LOCATION keyword.
38) How can you connect an application, if you run Hive as a server?
When execute Hive as a server, the application can be connected in one of the three ways-
- ODBC Driver-This supports the ODBC protocol
- JDBC Driver-This supports the JDBC protocol
- Thrift Client-Thrift client can be utilized to make calls to all hive commands using programming language such as PHP, Python, Java, C++ and Ruby.
39) Which classes are used by the Hive to Read and Write HDFS Files
Hive uses following classes to perform read and write operations:
TextInputFormat/HiveIgnoreKeyTextOutputFormat: These classes perform read and write data in simple text file format.
SequenceFileInputFormat/SequenceFileOutputFormat: These classes read/write data in hadoop SequenceFile format.
40) What are the types of tables in Apache Hive?
There are two types of tables in Apache Hive
- Managed tables.
- External tables.
41) Is it possible to create multiple table in hive for same data?
Yes
42) What kind of Data Warehouse application is suitable for Hive?
Apache Hive is not a full database. The design limitations of Hadoop and HDFS impose limits on what Hive can perform. Apache Hive is well built for data warehouse applications, where
- Relatively static data is analyzed,
- Fast response times are not required, and
- When the data is not changing rapidly.
Hive does not provide crucial properties needed for OLTP, Online Transaction Processing. Hive is well equipped for data warehouse applications, where a large data set is handled for insights, reports, etc.
43) What is the maximum size of string data type supported by Hive?
Maximum size is 2 GB.
MapReduce Interview Questions
44) What is MapReduce in Hadoop?
MapReduce is a framework for processing huge raw data sets utilizing a large number of computers. It helps to processes the raw data in two phases i.e. Map and Reduce phase. MapReduce programming model can be easily processed on large scale data. It is integrated with HDFS for processing distributed across data nodes of clusters.
45) What is YARN?
Yet Another Resource Negotiator or YARN is a Next generation MapReduce or MapReduce 2 or MRv2. It is applied in hadoop 0.23 release to overcome the scalability issue in classic MapReduce framework by dividing the functionality of Job tracker in MapReduce framework into Resource Manager.
46) What is data serialization?
Serialization is the way of converting object data into byte data stream for transmission over a network across different nodes in a cluster or for persistent data storage.
47) What is deserialization of data?
Deserialization is the inverse process of serialization and changes byte stream data into object data for reading data from HDFS. Apache Hadoop provides Writable for deserialization and serialization purpose.
48) What are the value/key Pairs in MapReduce framework?
MapReduce framework implements a data model in which data is shown as value/key pairs. Both output and input data to MapReduce framework should be in value/key pairs only.
49) What are the constraints to Value and Key classes in MapReduce?
Any datatype that can be utilized for a value/key field in a mapper or reducer must implement org.apache.hadoop.io.Writable for enabling the field to be deserialized and serialized. By default key/value fields should be comparable with each other. So, these must implement Hadoop’s org.apache.hadoop.io.WritableComparable Interface which in return extends Hadoop’s Writable interface.
50) What are the main components of MapReduce Job?
The key components of MapReduce are Main driver class, Mapper class and Reducer class.
51) What are the key configuration parameters that user require to specify to run MapReduce Job?
The user of MapReduce framework require to specify the following things:
- Job’s output location in the distributed file system.
- Job’s input location(s) in the distributed file system.
- Input format.
- Output format.
- Class containing map function.
- Class containing reduce function, but it is optional.
- JAR file containing the reducer and mapper classes and driver classes.
52) What are the key components of Job flow in YARN architecture?
- MapReduce job flow on YARN involves below components.
- A Client node, which submits the MapReduce job.
- YARN Node Managers, which launch and monitor the tasks of jobs.
- MapReduce Application Master, which coordinates the tasks running in the MapReduce job.
- YARN Resource Manager, which allocates the cluster resources to jobs.
- HDFS file system is used for sharing job files between the above entities.
53) What is the importance of Application Master in YARN architecture?
It helps in negotiating resources from the resource manager and working with the Node Manager(s) to run and monitor the tasks. Application Master makes request to containers for all map and reduce tasks. As Containers are assigned to tasks, it starts containers by reporting its Node Manager. It collects progress information from all the tasks and values are propagated to user or client node.
54) What is identity Mapper Apache Hadoop?
It is a default Mapper class provided by Apache Hadoop. Identity Mapper doesn’t process or manipulate or perform any task on input data rather it just writes the output data into input. Identity Mapper class name is org.apache.hadoop.mapred.lib.IdentityMapper.
55) What is identity Reducer in Apache Hadoop?
Identity Mapper just passes on the input value/key pairs into output directory. Identity Reducer class name is org.apache.hadoop.mapred.lib.IdentityReducer. When none of the reducer class is specified within MapReduce job, then this class will be taken up automatically by the job.
56) What is chain Mapper?
It is a special implementation of Mapper class through which number of mapper classes can be executed in a chain fashion, within a single map task. Chain Mapper’s class name is org.apache.hadoop.MapReduce.lib.ChainMapper.
57) What is chain reducer?
It is similar to Chain Mapper class through which a number of mappers followed by a single reducer can be executed in a single reducer task. Chain Mapper class name is org.apache.hadoop.MapReduce.lib.ChainReducer.
58) How to mention multiple mappers and reducer classes in Chain Reducer or Chain Mapper classes?
In Chain Mapper, ChainMapper.addMapper() method is used to add classes in mapper. In ChainReducer,
- ChainReducer.setReducer() method is utilized to pinpoint the single reducer class.
- ChainReducer.addMapper() method is used to add mapper classes.
59) What is side data distribution in MapReduce framework?
Side data is the extra read-only data needed by a MapReduce job to perform task on the main data set. In Hadoop there are 2 ways to make side data available to all the reduce or map tasks:
- Distributed cache
- Job Configuration
60) How distribution of side data can be done using job configuration?
By setting up an arbitrary key-value pairs in the job configuration using the various setter methods on Configuration object side data can be distributed. Within the task, we can get the data from the configuration method Context’s getConfiguration() method.
61) When can side data distribution be used for Job Configuration and when it is not supposed?
Side data distribution by job configuration is useful only when programmer need to pass a piece of meta data to map or reduce tasks. This mechanism shouldn’t be followed for moving more than a few KB’s of data because it imparts pressure on the memory usage, mainly in a system running hundreds of jobs.
62) What is Distributed Cache in MapReduce?
It is another way for side data distribution by duplicating files and archives to task nodes in time for tasks to use them when they execute. For saving network bandwidth, files are usually copied to any specific node once per job.
63) How to provide files or archives to MapReduce job in distributed cache mechanism?
Files that require to be distributed can be specified as a comma-separated list of URIs as the argument to the -files option in Apache Hadoop job command. Files can be on HDFS.
Archive files (tar files, ZIP files, and gzipped tar files) can be copied to task nodes by distributed cache by usage of -archives option.
64) Explain how distributed cache works in MapReduce Framework?
When Apache MapReduce job is submitted with distributed cache options, the node managers duplicates the files specified by the –archives, -files, and -libjars options from distributed cache to a local storage disk. local.cache.size property can be used to configure setup cache size on local storage disk of node managers. Data is localized under the ${hadoop.tmp.dir}/mapred/local directory.
65) What will Apache Hadoop do when a task has failed in a list of suppose 50 spawned tasks?
Apache Hadoop will restart the reduce or map task again on some other node manager & only if the task fails more than four times then it will kill the task. The default limit of maximum attempts for map and reduce tasks can be determined by using below mentioned properties in mapred-site.xml file.
- MapReduce.map.maxattempts
- MapReduce.reduce.maxattempts
Assume: In MapReduce system, HDFS block size is 256 MB and we have 3 files of size 248 KB, 268 MB and 512 MB then how many input splits will be created by Hadoop framework?
Hadoop will create 5 splits as follows
- 1 split for 248 KB file
- 2 splits for 268 MB file (1 of 256 MB and another of 12 MB)
- 2 splits for 512 MB file (2 Splits of 256 MB)
66) Why can’t we just have the file in HDFS and have the application read it instead of distributed cache?
Distributed cache duplicates the file to all node managers at the beginning of the job. Now, if the node manager runs 10 or 50 reduce or map tasks, it will utilize the same file copy.
Besides this, if a file requires to read from HDFS in the job then every reduce or map task will access it from HDFS and therefore if a node manager runs 50 map tasks then it will read this file 50 times from HDFS. Accessing the same data from node manager’s Local FS is lot faster than from HDFS data nodes.
67) After restarting namenode, MapReduce jobs started to fail which where working fine before restart. What may be the reason for such failure?
The Hadoop cluster may be in safe mode after the restart of namenode. The administrator waits for namenode to exit safe mode before restarting jobs again. This is one of the most common mistake done by Hadoop administrators.
68) What are the things that you need to mention for a MapReduce job?
A. Classes for mapper and reducer.
B. Classes for reducer, mapper, and combiner.
C. Classes for the reducer, partitioner, mapper, and combiner.
D. None
Answer: A) The classes for the mapper and reducer.
69) How many times combiner will execute?
A. At least once.
B. 0 or 1 time.
C. 0, 1, or many times.
D. Can’t be configured
Answer: C) Zero, one, or many times.
70) Suppose you have a mapper which produces for each key an integer value and the below set of
Reducer A: Give the maximum of the set of values.
Reducer B: Give the sum of the set of integer values.
Reducer C: Give the mean of the set of values.
Reducer D: Give the difference b/w the largest and smallest values in the set.
71) Which of the above mentioned reduce operations can be safely used as a combiner?
A. All of them.
B. A and B.
C. A, B, and D.
D. C and D.
E. None of them.
Answer: B) A and B.
72) What is Uber task in YARN?
If the job is not big, the application master may run them in the same JVM as itself, since it judges the overhead of allocating new containers and executing tasks in them as outweighing the advantage to be had in executing them in parallel, compared to executing them sequentially on the one node.
73) How to configure Uber Tasks?
A job by default has less than ten mappers only and 1 reducer, and the size of input is less than the size of 1 HDFS block is said to be small job. Values may be altered for a job by setting MapReduce.job.ubertask.maxreduces, MapReduce.job.ubertask.maxmaps , and MapReduce.job.ubertask.maxbytes. It is also possible to disable Uber tasks completely by setting MapReduce.job.ubertask.enable to false.
74) What are the three ways to debug a failed MapReduce job?
- By using MapReduce job counters
- Commonly there are two ways.
- YARN Web UI for checking into syslogs for actual status or error messages.
75) What is the significance of heartbeats in HDFS/MapReduce Framework?
A heartbeat in master or slave architecture is a signal depicting that it is active. A datanode sends heartbeats to Namenode and then node managers delivers their heartbeats to Resource Managers to say the master node that these are still active.
76) Can we rename the output file?
Yes
77) What are the default formats of input and output file in MapReduce jobs?
The default file format of input or output file are considered as text files, if they are not set.