Sawzall (programming Language)
   HOME

TheInfoList



OR:

Sawzall is a procedural domain-specific
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming l ...
, used by
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
to process large numbers of individual log records. Sawzall was first described in 2003, and the szl runtime was open-sourced in August 2010. However, since the
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filteri ...
table aggregators have not been released, the open-sourced runtime is not useful for large-scale data analysis of multiple log files off the shelf. Sawzall has been replaced by Lingo (logs in Go) for most purposes within Google.


Motivation

Google's server logs are stored as large collections of records (
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data. The method involves an inte ...
) that are partitioned over many disks within GFS. In order to perform calculations involving the logs, engineers can write
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filteri ...
programs in C++ or Java. MapReduce programs need to be compiled and may be more verbose than necessary, so writing a program to analyze the logs can be time-consuming. To make it easier to write quick scripts,
Rob Pike Robert "Rob" Pike (born 1956) is a Canadian programmer and author. He is best known for his work on the Go programming language and at Bell Labs, where he was a member of the Unix team and was involved in the creation of the Plan 9 from Bell Labs ...
et al. developed the Sawzall language. A Sawzall script runs within the Map phase of a MapReduce and "emits" values to tables. Then the Reduce phase (which the script writer does not have to be concerned about) aggregates the tables from multiple runs into a single set of tables. Currently, only the language runtime (which runs a Sawzall script once over a single input) has been open-sourced; the supporting program built on MapReduce has not been released.Discussion on which parts of Sawzall are open-source


Features

Some interesting features include: * A Sawzall script has a single input (a log record) and can output only by emitting to tables. The script can have no other side-effects. * A script can define any number of output tables. Table types include: ** collection saves every value emitted ** sum saves the sum of every emitted value ** maximum(n) saves only the highest n values on a given weight. *In addition, there are several statistical table types that give inexact results. The higher the parameter n, the more accurate the estimates are. ** sample(n) gives a random sample of n values from all the emitted values ** quantile(n) calculates a cumulative probability distribution of the given numbers. ** top(n) gives n values that are probably the most frequent of the emitted values. ** unique(n) estimates the number of unique values emitted. Sawzall's design favors efficiency and engine simplicity over power: * Sawzall is statically typed, and the engine compiles the script to x86 before running it. * Sawzall supports the compound data types lists, maps, and structs. However, there are no references or pointers. All assignments and function arguments create copies. This means that recursive data structures and cycles are impossible. * Like C, functions can modify
global variable In computer programming, a global variable is a variable with global scope, meaning that it is visible (hence accessible) throughout the program, unless shadowed. The set of all global variables is known as the ''global environment'' or ''global ...
s and
local variable In computer science, a local variable is a variable that is given ''local scope''. A local variable reference in the function or block in which it is declared overrides the same variable name in the larger scope. In programming languages with o ...
s but are not closures.


Sawzall code

This complete Sawzall program will read the input and produce three results: the number of records, the sum of the values, and the sum of the squares of the values. count: table sum of int; total: table sum of float; sum_of_squares: table sum of float; x: float = input; emit count <- 1; emit total <- x; emit sum_of_squares <- x * x;


See also

*
Pig The pig (''Sus domesticus''), often called swine, hog, or domestic pig when distinguishing from other members of the genus ''Sus'', is an omnivorous, domesticated, even-toed, hoofed mammal. It is variously considered a subspecies of ''Sus ...
– similar tool and language for use with Apache Hadoop * Sawmill (software)


References


Further reading

* S. Ghemawat, H. Gobioff, S.-T. Leung, The Google file system, in: 19th ACM Symposium on Operating Systems Principles, Proceedings, 17 ACM Press, 2003, pp. 29–43.


External links


Google Code Archive - Long-term storage for Google Code Project Hosting.

MapReduce
{{Google FOSS Domain-specific programming languages Procedural programming languages Google software Programming languages created in 2003 Software using the Apache license