Avro is a
row-oriented remote procedure call and data
serialization
In computing, serialization (or serialisation) is the process of translating a data structure or object state into a format that can be stored (e.g. files in secondary storage devices, data buffers in primary storage devices) or transmitted (e ...
framework developed within
Apache's Hadoop project. It uses
JSON
JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other ser ...
for defining
data type
In computer science and computer programming, a data type (or simply type) is a set of possible values and a set of allowed operations on it. A data type tells the compiler or interpreter how the programmer intends to use the data. Most progra ...
s and
protocols, and serializes data in a compact binary format. Its primary use is in
Apache Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
, where it can provide both a serialization format for
persistent data, and a
wire format
Wire data is the information that passes over computer and telecommunication networks defining communications between client and server devices. It is the result of decoding wire and transport protocols containing the bi-directional data payload. ...
for communication between Hadoop
nodes, and from client programs to the Hadoop
service
Service may refer to:
Activities
* Administrative service, a required part of the workload of university faculty
* Civil service, the body of employees of a government
* Community service, volunteer service for the benefit of a community or a pu ...
s.
Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages; one for human editing (Avro IDL) and another which is more
machine-readable based on JSON.
It is similar to
Thrift
Thrift may refer to:
* Frugality
* A savings and loan association in the United States
* Apache Thrift, a remote procedure call (RPC) framework
* Thrift (plant), a plant in the genus ''Armeria''
* Syd Thrift (1929–2006), American baseball exec ...
and
Protocol Buffers, but does not require running a code-generation program when a
schema changes (unless desired for
statically-typed languages).
Apache Spark SQL can access Avro as a data source.
Avro Object Container File
An Avro
Object Container File consists of:
* A
file header, followed by
* one or more file
data blocks.
A file header consists of:
* Four bytes,
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
'O', 'b', 'j', followed by the Avro version number which is 1 (0x01) (Binary values 0x4F 0x62 0x6A 0x01).
* File metadata, including the schema definition.
* The 16-byte, randomly-generated sync marker for this file.
For data blocks Avro specifies two serialization encodings: binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.
Schema definition
Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed).
Simple schema example:
Serializing and deserializing
Data in Avro might be stored with its corresponding schema, meaning a serialized item can be read without knowing the schema ahead of time.
Example serialization and deserialization code in Python
Serialization:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
schema = avro.schema.parse(open("user.avsc", "rb").read()) # need to know the schema to write. According to 1.8.2 of Apache Avro
writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
writer.append()
writer.append()
writer.close()
File "users.avro" will contain the schema in JSON and a compact binary representation of the data:
$ od -v -t x1z users.avro
0000000 4f 62 6a 01 04 14 61 76 72 6f 2e 63 6f 64 65 63 >Obj...avro.codec<
0000020 08 6e 75 6c 6c 16 61 76 72 6f 2e 73 63 68 65 6d >.null.avro.schem<
0000040 61 ba 03 7b 22 74 79 70 65 22 3a 20 22 72 65 63 >a..<
0000400 00 05 f9 a3 80 98 47 54 62 bf 68 95 a2 ab 42 ef >......GTb.h...B.<
0000420 24 04 2c 0c 41 6c 79 73 73 61 00 80 04 02 06 42 >$.,.Alyssa.....B<
0000440 65 6e 00 10 00 06 72 65 64 05 f9 a3 80 98 47 54 >en....red.....GT<
0000460 62 bf 68 95 a2 ab 42 ef 24 >b.h...B.$<
0000471
Deserialization:
reader = DataFileReader(open("users.avro", "rb"), DatumReader()) # the schema is embedded in the data file
for user in reader:
print(user)
reader.close()
This outputs:
Languages with APIs
Though theoretically any language could use Avro, the following languages have APIs written for them:
*
C
*
C++
*
C#
*
Elixir
*
Go
*
Haskell
*
Java
*
Javascript
*
Perl
*
PHP
*
Python
*
Ruby
*
Rust
*
Scala
Avro IDL
In addition to supporting JSON for type and protocol definitions, Avro includes experimental support for an alternative
interface description language
interface description language or interface definition language (IDL), is a generic term for a language that lets a program or object written in one language communicate with another program written in an unknown language. IDLs describe an inter ...
(IDL) syntax known as Avro IDL. Previously known as GenAvro, this format is designed to ease adoption by users familiar with more traditional IDLs and programming languages, with a syntax similar to C/C++,
Protocol Buffers and others.
Logo
The Apache Avro logo is from the defunct British aircraft manufacturer
Avro
AVRO, short for Algemene Vereniging Radio Omroep ("General Association of Radio Broadcasting"), was a Dutch public broadcasting association operating within the framework of the Nederlandse Publieke Omroep system. It was the first public broad ...
(originally A.V. Roe and Company). Football team
Avro F.C.
Avro Football Club is a football club based in the Limeside area of Oldham, Greater Manchester. They are currently members of the and play at the Whitebank Stadium.
History
The club was founded in 1936 at the Failsworth factory of British ai ...
uses the same logo.
See also
*
Comparison of data serialization formats
*
Apache Thrift
*
Protocol Buffers
*
Etch (protocol)
Etch was an open-source, cross-platform framework for building network services, first announced in May 2008 by Cisco Systems. Etch encompasses a service description language, a compiler, and a number of language bindings. It is intended to suppl ...
*
Internet Communications Engine
*
MessagePack
*
CBOR
References
Further reading
*
{{DEFAULTSORT:Avro
Apache Software Foundation projects
Inter-process communication
Application layer protocols
Remote procedure call
Data serialization formats
Articles with example Python (programming language) code