Introduction to Apache Spark

Jan 10, 2022 8 min read

What is Apache Spark?

Apache Spark’s goal is to make working with Big Data easier.

But how can we achieve a higher efficiency in handling Big Data?

The short answer: by using a technique called parallel programming. But what exactly is meant by that?

In parallel programming, different computers are “used simultaneously” - that is, by running in parallel - for various computational tasks on a dataset. By having several computers working on the same computational task at the same time, the computations are performed faster. This is especially useful when calculations need to be done on gigabytes of data, e.g. in a “Big Data”-settings.

Apache Spark or `pandas` for relatively large datasets?

Before starting any kind of (Big Data) project, the following rule of thumb should be applied:

If your local computer storage is large enough to load a dataset, choose the pandas module.
If you are working with Big Data - that is, with giga- or terrabytes of data - Apache Spark is recommended.

Apache Spark Functionalities

Apache Spark offers a large ecosystem of functionalities for working with Big Data. These include:

Data cleaning,
Machine learning,
- In Python, Spark’s MLlib module can be used.
Graph processing,
Storing data,
Exporting data to various data formats,
Data streaming,
SQL queries.

Historical Context

Apache Spark was founded in 2009 by a group of PhD students at the University of Berkeley. It is an open source project.

To promote the use of Apache Spark, the same team of PHD students developed the website Data Bricks, which is a commercial platform that aims to simplify working with Apache Spark (with a “pay-as-you-go” pricing strategy).

However, the Data Bricks website is not necessarily required to work with Apache Spark!

Requirements to use Apache Spark

It is recommended to set up a virtual environment on your local computer to be able to use Apache Spark without restrictions.
- This is where the advantage of vendor Data Bricks comes into play, as Databricks’ servers automatically take care of setting up a virtual environment.

Apache Spark Definitions

As with any new computer science technology, Apache Spark comes with a set of technical terms which need to be clarified first before diving into any code example.

Here’s an overview of the most important terms:

Spark: “Spark” describes the specific process of distributing computational capability across different computers, in order for them to perform parallel computations - on multiple computers simultaneously.
Driver: The “Driver” is the computer instance which is “the boss” of all other computers running in parallel.
Worker: The “Worker” defines one of the many computers that run simultaneoulsy in parallel. Each “worker” receives its task from the “driver”, e.g. the computer that supervises all other.
- Typically there is a number \(N\) of “Workers”.
  - This number \(N\) varies, depending on the computational workload (assigned by the Driver).
Executor: Within each “Worker”-instance, there is an “Executor”, which is responsible for actually executing a computational task (assigned by the Driver).
- Important note: It may be (but not necessarily have to be!) that - within a “Worker” - several “Executors” execute a computational task.
  - However, when using Data Bricks, only 1 “Executor” is used per “Worker”.
Spark Cluster: see picture on ZF
Job: This is a specific computational task which the “Worker” must execute (it was assigned by the “Driver”).
Stages: If a “Job” consists of several computational tasks, then the “Job” - by definition - has one (or more) “Stages”.
- Example: “Stage 1” (= sub-calculation 1) + “Stage 2” (= sub-calculation 2) = “Job 1” –> In this case, the “Job” consists of “2 stages” in total.
  - Shuffle: In the context of the above example, “Stage 2” may initially depend on the calculation of “Stage 1”. Hence, an exchange of data between “Stage 1” & “Stage 2” occurs, in order for any calculations of “Stage 2” to be performed at all.
    - Example: “Stage 1” could be to count all observations within a dataset, while “Stage 2” would be the computation of the mean. By definition, calculating the mean requires the number of observations within the dataset (which we do in “Stage 1”). It is at this very specific point in time, that the “shuffle” occurs, i.e. the exchange of data between “Stage 1” & “Stage 2” (in order to perform the computation of the mean in “Stage 2”).
Task: A “Job” can be broken down into the number of “Stages”, depending on how many computational tasks a “Worker” needs to perform. “Stages” can be broken down even further, to so-called “tasks”. These represent the smallest unit of computation in Apache Spark.
- Example: “Task 1” + “Task 2” = “Job 1” –> In this case the “Job” consists of “2 stages” in total, where each of these two “Stages” represent a single (sub-) “Task”.

Best Practices in using Apache Spark?

Apache Spark comes with 2 APIs from which we can choose from, in order to manipulate data with Python & SQL:

The RDD API (2009),
As well as the DataFrame API (2015).

Both of these APIs use similar syntax to the pandas Python module for processing data. While either of the APIs work fine when using Apache Spark, it is recommended to choose the DataFrame API over the (deprecated) RDD API. Additionally, note that the DataFrame API is built on top of the RDD API.

RDD API

RDD is an acronym which stands for Resilient Distributed Dataset.

Historical Context:
When Apache Spark was created, the syntax from the RDD API formed the core of Apache Spark. Even if - today - the RDD API is older, the more modern DataFrame API builds on the old RDD architecture.

Therefore, in order to efficiently use either of the Apache Spark APIs, it is important to first learn about the old RDD API, especially the 3 words that make up the “RDD” acronym (as a reminder, RDD stands for “resilient distributed dataset”).

More specifically, we are interested in 2 things:

What exactly does “resilient” mean?
And what exactly is a “distributed dataset”?

What does “distributed dataset” mean?

The word “dataset” in the term “distributed dataset” is pretty self-explanatory: it represents a collection of many individual data points combined into one full dataset.

In the context of Big Data, a “distributed dataset” is now meant to be a collection of computers running in parallel, where each single computer-instance is processing a fraction of our dataset.

More precisely, this means that our (gigabyte-sized) dataset is - first - distributed across many individual computer instances before the data processing takes place. In the end, this proceidure of splitting the data into “chunks” (also called a “data partition”) guarantees a higher computational performance, which is reflected in producing faster outputs when handling Big Data.

What does “resilient” mean?

As we now know, Apache Spark handles simulateous computations on different threads. One of the key-features of these simultaneous computations is that they are “resilient”.

Let’s look at an example:
One frequently encountered issue that may happen when we use parallel programming, is that - during the process of handling gigabytes of data - one of our “Workers” (that perform the computational tasks) may “die” / not work anymore. However - since simultaneous computations in Apache Spark are “resilient” - the computational task will still be carried out, even if a “Worker” fails to do the job. However, it is to note that - if the “Driver” dies, the entire computational task will fail!

The main Idea behind Apache Spark

One of Apache Spark’s core concepts is based on the fact that - when processing data - there exists two different types of computations:

"Transformations": when performing such a type of “data processing” operation, nothing will happen (i.e. no “Spark Job” will be executed), because transformations are considered to be “lazy” by Spark. More specifically, a cell in a Jupyter-Notebook will be executed, but this will show no output.
- Note: Transformations include “filter” & “sort” operations, for example.
"Actions": In contrast to transformations, when “Actions” are executed, a calculation is performed and an output will be visible, if we execute a cell in a Jupyter-Notebook.

An example about “Transformations”:

Suppose we have a gigabyte-sized COVID-19 dataset for the US at the state-level.

Now, we want to:

First - sort the data according to the latest COVID-19 data, and
Second - filter only the data specifically from LA.

When we enter these two operations as “Spark code”, it is successfully executed in only 0.15ms, but NO output will be visible! Nothing seems to have happened. And indeed: nothing happened!

But why did nothing happen?

Because we are using computations that are from the type “Transformation”. In this case, Spark will perform the calculations “lazily”.

If we want to make an output visible, we have 2 options:

We need to perform a calculation that is from the type “action” (on the dataset), and / or
We must specifically trigger the calculation that is from the type “Transformation”.

“Lazy” Computation: what exactly does this mean?

The fact that nothing happens in any kind of computation that is from the type “Transformation” has to do with the size of the data set (oftentimes, they are giga- or terrabytes big in memory). If we would apply a data processing method on such a large dataset, then - without the “lazy” data processing - computations would be performed line by line over the entire dataset and - in a best case scenario - only output a result after a (very) long time.

If, however, our computer has too little computational-memory at its disposition, such a calculation would - in a worst case scenario - fail. Hence, a computation that is exectured “lazily” will prevent a computer (that has not enough internal memory) to fail.

This ability to perform computations “lazily” - combined with the property of performing parallelized computational operations - is the reason why Spark is able to handle intensive computational operations on any kind of datasets efficiently, making it a “good” framework for Big Data.