Apache Beam is an
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
unified programming model to define and execute data processing
pipelines
A pipeline is a system of pipes for long-distance transportation of a liquid or gas, typically to a market area for consumption. The latest data from 2014 gives a total of slightly less than of pipeline in 120 countries around the world. The Un ...
, including
ETL,
batch and
stream
A stream is a continuous body of water, body of surface water Current (stream), flowing within the stream bed, bed and bank (geography), banks of a channel (geography), channel. Depending on its location or certain characteristics, a strea ...
(continuous) processing.
Beam Pipelines are defined using one of the provided
SDKs and executed in one of the Beam’s supported ''runners'' (
distributed processing
Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers.
The components of a distributed system commun ...
back-ends) including
Apache Flink
Apache Flink is an Open-source software, open-source, unified stream processing, stream-processing and batch processing, batch-processing software framework, framework developed by the Apache Software Foundation. The core of Apache Flink is a dis ...
,
Apache Samza,
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
, and
Google Cloud Dataflow.
History
Apache Beam
is one implementation of the Dataflow model paper.
The Dataflow model is based on previous work on distributed processing abstractions at Google, in particular on FlumeJava
and Millwheel.
Google released an open SDK implementation of the Dataflow model in 2014 and an environment to execute Dataflows locally (non-distributed) as well as in the
Google Cloud Platform
Google Cloud Platform (GCP) is a suite of cloud computing services offered by Google that provides a series of modular cloud services including computing, Computer data storage, data storage, Data analysis, data analytics, and machine learnin ...
service.
Timeline
Apache Beam makes minor releases every 6 weeks.
See also
*
List of Apache Software Foundation projects
References
{{Google FOSS
Apache Software Foundation
Apache Software Foundation projects
Big data products
Cluster computing
Distributed stream processing
Google software
Hadoop
Java platform
Free software programmed in Java (programming language)