Apache Arrow is a
language-agnostic software framework
In computer programming, a software framework is an abstraction in which software, providing generic functionality, can be selectively changed by additional user-written code, thus providing application-specific software. It provides a standard ...
for developing data analytics applications that process
columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern
CPU
A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, an ...
and
GPU
A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mob ...
hardware.
This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of
dynamic random-access memory
Dynamic random-access memory (dynamic RAM or DRAM) is a type of random-access semiconductor memory that stores each bit of data in a memory cell, usually consisting of a tiny capacitor and a transistor, both typically based on metal-oxi ...
.
Interoperability
Arrow can be used with
Apache Parquet
Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processin ...
,
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Califor ...
,
NumPy,
PySpark,
pandas
Pediatric autoimmune neuropsychiatric disorders associated with streptococcal infections (PANDAS) is a controversial hypothetical diagnosis for a subset of children with rapid onset of obsessive-compulsive disorder (OCD) or tic disorders. Sy ...
and other data processing libraries.
The project includes native
software libraries
In computer science, a library is a collection of non-volatile resources used by computer programs, often for software development. These may include configuration data, documentation, help data, message templates, pre-written code and subro ...
written in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems.
Applications
Arrow has been used in diverse domains, including analytics, genomics,
and cloud computing.
Comparison to Apache Parquet and ORC
Apache Parquet
Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processin ...
and
Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory. The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage. The Arrow and Parquet projects include libraries that allow for reading and writing data between the two formats.
Governance
Apache Arrow was announced by
The Apache Software Foundation
The Apache Software Foundation (ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open source software projects. The ASF was formed from a group of developers of the ...
on February 17, 2016,
with development led by a coalition of developers from other open source data analytics projects.
The initial codebase and Java library was seeded by code from
Apache Drill
Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's D ...
.
References
External links
Apache Arrowproject web site
Apache Arrow GitHubproject source code
{{Apache Software Foundation
Apache Software Foundation
Apache Software Foundation projects
Software frameworks