Introduction to HDFS Erasure Coding in Apache Hadoop

Hadoop is a popular open-source implementation of MapReduce framework designed to analyze large data sets. It has two parts; Hadoop Distributed File System (HDFS) and MapReduce. HDFS is the file system used by Hadoop to store its data. It has become popular due to its reliability, scalability, and low-cost storage capability.

HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. It also eases scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from.

However, replication is expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). For datasets with relatively low I/O activity, the additional block replicas are rarely accessed during normal operations, but still consume the same amount of storage space.

Erasure Coding

A Maximum Distance Separable (MDS) erasure code

BigData / Hadoop

Introduction to HDFS Erasure Coding in Apache Hadoop

Drill into Your Big Data Today with Apache Drill

Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data

Leave A Reply Cancel reply

Become An Instructor?

BigData / Hadoop

You may also like

Leave A Reply Cancel reply

Login with your site account

Register a new account