Apache Spark: The Hot Kid on the Block
Spark was developed by University of California in Berkley around 2009, and it becomes an open source in 2010, As most of the Hadoop services are open source, which is cost effective, and constantly keeps on growing with features according to user’s requirement.
Spark speeds up to 100 times faster than Hadoop MapReduce in memory and 10 times faster while running on disk. Spark processing is faster than Hadoop MapReduce because it stores the intermediate data in memory unlike storing on temp disk.
Spark also provides flexibility to run on standalone, Hadoop cluster, making use of underneath HDFS storage. Spark works well when all the data totally fit inside the memory, but for huge interactive/batch data MapReduce suites well.
Spark supports development in Java, Scala or Python and also provides multiple APIs.
Spark SQL: Provides SQL kind of interaction
Spark Streaming: For real time and near real time data.
MLlib: provides library containing machine learning functionality.
Pros:
- 100 times faster than Hadoop.(With In memory processing)
- Same platform for real-time and batch processing (as the future of analytics is in Real Time analytics).
- Developed in Scala (functional programming) which is suitable for Distributed Systems, a lot can be accomplished in a small piece of code. It’s more readable also and easy to understand.
- Apache streaming is suitable for many real-time analytics use cases.
It’s highly configurable and once you know how to get the best for your use case, you can make it work for really well.
Cons:
- It consumes a lot of memory, and issues around memory consumption (and garbage collection)
- Not handled in a user-friendly manner.