Apache Spark is an
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
unified analytics engine for large-scale data processing. Spark provides an
interface
Interface or interfacing may refer to:
Academic journals
* ''Interface'' (journal), by the Electrochemical Society
* '' Interface, Journal of Applied Linguistics'', now merged with ''ITL International Journal of Applied Linguistics''
* '' Int ...
for programming clusters with implicit
data parallelism
Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures lik ...
and
fault tolerance
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
. Originally developed at the
University of California, Berkeley
The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California) is a public land-grant research university in Berkeley, California. Established in 1868 as the University of California, it is the state's first land-grant u ...
's
AMPLab
AMPLAB was a University of California, Berkeley lab focused on big data analytics located in Soda Hall. The name stands for the Algorithms, Machines and People Lab. It has been publishing papers since 2008 and was officially launched in 2011. The ...
, the Spark
codebase
In software development, a codebase (or code base) is a collection of source code used to build a particular software system, application, or software component. Typically, a codebase includes only human-written source code files; thus, a codeba ...
was later donated to the
Apache Software Foundation, which has maintained it since.
Overview
Apache Spark has its architectural foundation in the resilient distributed dataset (RDD), a read-only
multiset
In mathematics, a multiset (or bag, or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements. The number of instances given for each element is called the multiplicity of that ...
of data items distributed over a cluster of machines, that is maintained in a
fault-tolerant
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
way.
The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API. In Spark 1.x, the RDD was the primary
application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged even though the RDD API is not
deprecated
In several fields, especially computing, deprecation is the discouragement of use of some terminology, feature, design, or practice, typically because it has been superseded or is no longer considered efficient or safe, without completely removing ...
. The RDD technology still underlies the Dataset API.
Spark and its RDDs were developed in 2012 in response to limitations in the
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filteri ...
cluster computing
paradigm, which forces a particular linear
dataflow
In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming.
Software architecture
Da ...
structure on distributed programs: MapReduce programs read input data from disk,
map a function across the data,
reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a
working set
Working set is a concept in computer science which defines the amount of memory that a process requires in a given time interval.
Definition
Peter Denning (1968) defines "the working set of information W(t, \tau) of a process at time t to be the ...
for distributed programs that offers a (deliberately) restricted form of distributed
shared memory
In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between progr ...
.
Inside Apache Spark the workflow is managed as a
directed acyclic graph
In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one v ...
(DAG). Nodes represent RDDs while edges represent the operations on the RDDs.
Spark facilitates the implementation of both
iterative algorithm
In computational mathematics, an iterative method is a mathematical procedure that uses an initial value to generate a sequence of improving approximate solutions for a class of problems, in which the ''n''-th approximation is derived from the pre ...
s, which visit their data set multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
-style querying of data. The
latency of such applications may be reduced by several orders of magnitude compared to
Apache Hadoop MapReduce implementation.
Among the class of iterative algorithms are the training algorithms for
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
systems, which formed the initial impetus for developing Apache Spark.
Apache Spark requires a
cluster manager Within cluster and parallel computing
Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the sa ...
and a
distributed storage system. For cluster management, Spark supports standalone (native Spark cluster, where you can launch a cluster either manually or use the launch scripts provided by the install package. It is also possible to run these daemons on a single machine for testing),
Hadoop YARN,
Apache Mesos
Apache Mesos is an open-source project to manage computer clusters. It was developed at the University of California, Berkeley.
History
Mesos began as a research project in the UC Berkeley RAD Lab by then PhD students Benjamin Hindman, Andy Konw ...
or
Kubernetes
Kubernetes (, commonly stylized as K8s) is an open-source container orchestration system for automating software deployment, scaling, and management. Google originally designed Kubernetes, but the Cloud Native Computing Foundation now maintai ...
. For distributed storage, Spark can interface with a wide variety, including
Alluxio
Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis,
advised by Professor Scott Shenker ...
,
Hadoop Distributed File System (HDFS),
MapR File System (MapR-FS),
Cassandra
Cassandra or Kassandra (; Ancient Greek: Κασσάνδρα, , also , and sometimes referred to as Alexandra) in Greek mythology was a Trojan priestess dedicated to the god Apollo and fated by him to utter true prophecies but never to be believe ...
,
OpenStack Swift,
Amazon S3
Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its ...
,
Kudu
The kudus are two species of antelope of the genus ''Tragelaphus'':
* Lesser kudu, ''Tragelaphus imberbis'', of eastern Africa
* Greater kudu, ''Tragelaphus strepsiceros'', of eastern and southern Africa
The two species look similar, thoug ...
,
Lustre file system, or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor per
CPU core
A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, an ...
.
Spark Core
Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic
I/O functionalities, exposed through an application programming interface (for
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
,
Python,
Scala,
.NET and
R) centered on the RDD
abstraction
Abstraction in its main sense is a conceptual process wherein general rules and concepts are derived from the usage and classification of specific examples, literal ("real" or " concrete") signifiers, first principles, or other methods.
"An a ...
(the Java API is available for other JVM languages, but is also usable for some other non-JVM languages that can connect to the JVM, such as
Julia
Julia is usually a feminine given name. It is a Latinate feminine form of the name Julio and Julius. (For further details on etymology, see the Wiktionary entry "Julius".) The given name ''Julia'' had been in use throughout Late Antiquity (e ...
). This interface mirrors a
functional
Functional may refer to:
* Movements in architecture:
** Functionalism (architecture)
** Form follows function
* Functional group, combination of atoms within molecules
* Medical conditions without currently visible organic basis:
** Functional s ...
/
higher-order model of programming: a "driver" program invokes parallel operations such as map,
filter
Filter, filtering or filters may refer to:
Science and technology
Computing
* Filter (higher-order function), in functional programming
* Filter (software), a computer program to process a data stream
* Filter (video), a software component tha ...
or reduce on an RDD by passing a function to Spark, which then schedules the function's execution in parallel on the cluster. These operations, and additional ones such as
joins Join may refer to:
* Join (law), to include additional counts or additional defendants on an indictment
*In mathematics:
** Join (mathematics), a least upper bound of sets orders in lattice theory
** Join (topology), an operation combining two topo ...
, take RDDs as input and produce new RDDs. RDDs are
immutable
In object-oriented computer programming, object-oriented and Functional programming, functional programming, an immutable object