With each year, there seem to be more and more distributed systems on the market to manage data volume, variety, and velocity. Among these systems, Hadoop and Spark are the two that continue to get the most mindshare. But how can you decide which is right for you?
If you want to process clickstream data, does it make sense to batch it and import it into HDFS, or work with Spark Streaming? If you’re looking to do machine learning and predictive modeling, would Mahout or MLLib suit your purposes better?
To add to the confusion, Spark and Hadoop often work together with Spark processing data that sits in HDFS, Hadoop’s file system. But, they are distinct and separate entities, each with their own pros and cons and specific business-use cases.
This article will take a look at two systems, from the following perspectives: architecture, performance, costs, security, and machine learning.
Hadoop got its start as a Yahoo project in 2006, becoming a top-level Apache open-source project later on. It’s a general-purpose form of distributed processing that has several components: the Hadoop Distributed File System (HDFS), which stores files in a Hadoop-native format and parallelizes them across a cluster; YARN, a schedule that coordinates application runtimes; and MapReduce, the algorithm that actually processes the data in parallel. Hadoop is built in Java, and accessible through many programming languages, for writing MapReduce code, including Python, through a Thrift client.
In addition to these basic components, Hadoop also includes Sqoop, which moves relational data into HDFS; Hive, a SQL-like interface allowing users to run queries on HDFS; and Mahout, for machine learning. In addition to using HDFS for file storage, Hadoop can also now be configured to use S3 buckets or Azure blobs as input.
Spark is a newer project, initially developed in 2012, at the AMPLab at UC Berkeley. It’s also a top-level Apache project focused on processing data in parallel across a cluster, but the biggest difference is that it works in-memory.
Whereas Hadoop reads and writes files to HDFS, Spark processes data in RAM using a concept known as an RDD, Resilient Distributed Dataset. Spark can run either in stand-alone mode, with a Hadoop cluster serving as the data source, or in conjunction with Mesos. In the latter scenario, the Mesos master replaces the Spark master or YARN for scheduling purposes.
Spark is structured around Spark Core, the engine that drives the scheduling, optimizations, and RDD abstraction, as well as connects Spark to the correct filesystem (HDFS, S3, RDBMs, or Elasticsearch). There are several libraries that operate on top of Spark Core, including Spark SQL, which allows you to run SQL-like commands on distributed data sets, MLLib for machine learning, GraphX for graph problems, and streaming which allows for the input of continually streaming log data.
Spark has several APIs. The original interface was written in Scala, and based on heavy usage by data scientists, Python and R endpoints were also added. Java is another option for writing Spark jobs.
Databricks, the company founded by Spark creator Matei Zaharia, now oversees Spark development and offers Spark distribution for clients.
To start with, all the files passed into HDFS are split into blocks. Each block is replicated a specified number of times across the cluster based on a configured block size and replication factor. That information is passed to the NameNode, which keeps track of everything across the cluster. The NameNode assigns the files to a number of data nodes on which they are then written. High availability was implemented in 2012, allowing the NameNode to failover onto a backup Node to keep track of all the files across a cluster.
The MapReduce algorithm sits on top of HDFS and consists of a JobTracker. Once an application is written in one of the languages Hadoop accepts the JobTracker, picks it up, and allocates the work (which could include anything from counting words and cleaning log files, to running a HiveQL query on top of data stored in the Hive warehouse) to TaskTrackers listening on other nodes.
YARN allocates resources that the JobTracker spins up and monitors them, moving the processes around for more efficiency. All the results from the MapReduce stage are then aggregated and written back to disk in HDFS.
Spark handles work in a similar way to Hadoop, except that computations are carried out in memory and stored there, until the user actively persists them. Initially, Spark reads from a file on HDFS, S3, or another filestore, into an established mechanism called the SparkContext. Out of that context, Spark creates a structure called an RDD, or Resilient Distributed Dataset, which represents an immutable collection of elements that can be operated on in parallel.
As the RDD and related actions are being created, Spark also creates a DAG, or Directed Acyclic Graph, to visualize the order of operations and the relationship between the operations in the DAG. Each DAG has stages and steps; in this way, it’s similar to an explain plan in SQL.
You can perform transformations, intermediate steps, actions, or final steps on RDDs. The result of a given transformation goes into the DAG but does not persist to disk, but the result of an action persists all the data in memory to disk.
A new abstraction in Spark is DataFrames, which were developed in Spark 2.0 as a companion interface to RDDs. The two are extremely similar, but DataFrames organize data into named columns, similar to Python’s pandas or R packages. This makes them more user-friendly than RDDs, which don’t have a similar set of column-level header references. SparkSQL also allows users to query DataFrames much like SQL tables in relational data stores.
Spark has been found to run 100 times faster in-memory, and 10 times faster on disk. It’s also been used to sort 100 TB of data 3 times faster than Hadoop MapReduce on one-tenth of the machines. Spark has particularly been found to be faster on machine learning applications, such as Naive Bayes and k-means.
Spark performance, as measured by processing speed, has been found to be optimal over Hadoop, for several reasons:
- Spark is not bound by input-output concerns every time it runs a selected part of a MapReduce task. It’s proven to be much faster for applications
- Spark’s DAGs enable optimizations between steps. Hadoop doesn’t have any cyclical connection between MapReduce steps, meaning no performance tuning can occur at that level.
However, if Spark is running on YARN with other shared services, performance might degrade and cause RAM overhead memory leaks. For this reason, if a user has a use-case of batch processing, Hadoop has been found to be the more efficient system.
Both Spark and Hadoop are available for free as open-source Apache projects, meaning you could potentially run it with zero installation costs. However, it is important to consider the total cost of ownership, which includes maintenance, hardware and software purchases, and hiring a team that understands cluster administration. The general rule of thumb for on-prem installations is that Hadoop requires more memory on disk and Spark requires more RAM, meaning that setting up Spark clusters can be more expensive. Additionally, since Spark is the newer system, experts in it are rarer and more costly. Another option is to install using a vendor such as Cloudera for Hadoop, or Spark for DataBricks, or run EMR/MapReduce processes in the cloud with AWS.
Extract pricing comparisons can be complicated to split out since Hadoop and Spark are run in tandem, even on EMR instances, which are configured to run with Spark installed. For a very high-level point of comparison, assuming that you choose a compute-optimized EMR cluster for Hadoop the cost for the smallest instance, c4.large, is $0.026 per hour. The smallest memory-optimized cluster for Spark would cost $0.067 per hour. Therefore, on a per-hour basis, Spark is more expensive, but optimizing for compute time, similar tasks should take less time on a Spark cluster.
Fault Tolerance and Security
Hadoop is highly fault-tolerant because it was designed to replicate data across many nodes. Each file is split into blocks and replicated numerous times across many machines, ensuring that if a single machine goes down, the file can be rebuilt from other blocks elsewhere.
Spark’s fault tolerance is achieved mainly through RDD operations. Initially, data-at-rest is stored in HDFS, which is fault-tolerant through Hadoop’s architecture. As an RDD is built, so is a lineage, which remembers how the dataset was constructed, and, since it’s immutable, can rebuild it from scratch if need be. Data across Spark partitions can also be rebuilt across data nodes based on the DAG. Data is replicated across executor nodes, and generally can be corrupted if the node or communication between executors and drivers fails.
Both Spark and Hadoop have access to support for Kerberos authentication, but Hadoop has more fine-grained security controls for HDFS. Apache Sentry, a system for enforcing fine-grained metadata access, is another project available specifically for HDFS-level security.
Spark’s security model is currently sparse, but allows authentication via shared secret.
Hadoop uses Mahout for processing data. Mahout includes clustering, classification, and batch-based collaborative filtering, all of which run on top of MapReduce. This is being phased out in favor of Samsara, a Scala-backed DSL language that allows for in-memory and algebraic operations, and allows users to write their own algorithms.
Spark has a machine learning library, MLLib, in use for iterative machine learning applications in-memory. It’s available in Java, Scala, Python, or R, and includes classification, and regression, as well as the ability to build machine-learning pipelines with hyperparameter tuning.
Summing it up
So is it Hadoop or Spark? These systems are two of the most prominent distributed systems for processing data on the market today. Hadoop is used mainly for disk-heavy operations with the MapReduce paradigm, and Spark is a more flexible, but more costly in-memory processing architecture. Both are Apache top-level projects, are often used together, and have similarities, but it’s important to understand the features of each when deciding to implement them.