OpenRefine is an
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
desktop application for data cleanup and transformation to other formats, an activity commonly known as
data wrangling
Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one " raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes ...
. It is similar to
spreadsheet
A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cel ...
applications, and can handle spreadsheet file formats such as CSV, but it behaves more like a database.
It operates on ''rows'' of data which have cells under ''columns,'' similar to the manner in which
relational database
A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
tables operate. OpenRefine projects consist of one table, whose rows can be filtered using ''facets'' that define criteria (for example, showing rows where a given column is not empty).
Unlike spreadsheets, most operations in OpenRefine are done on all visible rows, for example, the transformation of all cells in all rows under one column, or the creation of a new column based on existing data. Actions performed on a dataset are stored the project and can be 'replayed' on other datasets. Formulas are not stored in cells, but are used to transform the data. Transformation is done only once. Formula expressions can be written in General Refine Expression Language (GREL), in
Jython
Jython is an implementation of the Python programming language designed to run on the Java platform. The implementation was formerly known as JPython until 1999.
Overview
Jython programs can import and use any Java class. Except for some standa ...
(i.e., Python), and in
Clojure.
The program operates as a local web app: it starts a
web server
A web server is computer software and underlying hardware that accepts requests via HTTP (the network protocol created to distribute web content) or its secure variant HTTPS. A user agent, commonly a web browser or web crawler, initiate ...
and opens the default browser to
127.0.0.1
1 (one, unit, unity) is a number representing a single or the only entity. 1 is also a numerical digit and represents a single unit of counting or measurement. For example, a line segment of ''unit length'' is a line segment of length 1. I ...
:3333.
Uses
* ''Cleaning messy data'': for example if working with a text file with some semi-structured data, it can be edited using transformations, facets and clustering to make the data cleanly structured.
* ''Transformation of data'': converting values to other formats, normalizing and denormalizing.
* ''Parsing data from web sites'': OpenRefine has a URL fetch feature and
jsoup
jsoup is an open-source Java library designed to parse, extract, and manipulate data stored in HTML documents.
History
jsoup was created in 2009 by Jonathan Hedley. It is distributed it under the MIT License, a permissive free software license s ...
HTML parser and DOM engine.
* ''Adding data to dataset by fetching it from web services'' (i.e. returning
JSON
JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other ser ...
). For example, can be used for
geocoding addresses to
geographic coordinates.
* ''Aligning to
Wikidata'' (formerly
Freebase
Freebase may refer to:
*Free base or freebase, the pure basic form of an amine, as opposed to its salt form
*Freebase (database), a former online database service
*Freebase (mixtape), ''Freebase'' (mixtape), 2014 mixtape by 2 Chainz
*An original ...
): this involves ''reconciliation'' — mapping string values in cells to entities in Wikidata.
Supported formats
Import is supported from following formats:
*
TSV,
CSV
* Text file with custom separators or columns split by fixed width
*
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
*
RDF triples (
RDF/XML and
Notation3
Notation3, or N3 as it is more commonly known, is a shorthand non-XML serialization of Resource Description Framework models, designed with human-readability in mind: N3 is much more compact and readable than XML RDF notation. The format is being ...
serialization formats)
*
JSON
JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other ser ...
*
Google Spreadsheets
Google Sheets is a spreadsheet program included as part of the free, web-based Google Docs Editors suite offered by Google. The service also includes: Google Docs, Google Slides, Google Drawings, Google Forms, Google Sites and Google Keep. Google ...
If input data is in a non-standard text format, it can be imported as whole lines, without splitting into columns, and then columns extracted later with OpenRefine's tools. Archived and compressed files are supported (.zip, .tar.gz, .tgz, .tar.bz2, .gz, or .bz2) and Refine can download input files from a
URL
A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifie ...
. To use web pages as input, it is possible to import a list of URLs and then invoke a URL fetch function.
Export is supported in following formats:
*
TSV
*
CSV
*
Microsoft Excel
Microsoft Excel is a spreadsheet developed by Microsoft for Microsoft Windows, Windows, macOS, Android (operating system), Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro (comp ...
*
HTML table
An HTML element is a type of HTML (HyperText Markup Language) document component, one of several types of HTML nodes (there are also text nodes, comment nodes and others). The first used version of HTML was written by Tim Berners-Lee in 1993 ...
*
Google Spreadsheets
Google Sheets is a spreadsheet program included as part of the free, web-based Google Docs Editors suite offered by Google. The service also includes: Google Docs, Google Slides, Google Drawings, Google Forms, Google Sites and Google Keep. Google ...
* Templating exporter: it is possible to define custom template for outputting data, for example as
MediaWiki
MediaWiki is a free and open-source wiki software. It is used on Wikipedia and almost all other Wikimedia websites, including Wiktionary, Wikimedia Commons and Wikidata; these sites define a large part of the requirement set for MediaWiki ...
table.
Whole OpenRefine projects in native format can be exported as a
.tar.gz archive.
Development
OpenRefine started life as Freebase Gridworks, developed by
Metaweb
Metaweb Technologies, Inc. was a San Francisco-based company that developed Freebase, described as an "open, shared database of the world's knowledge". The company was co-founded by Danny Hillis, Veda Hlubinka-Cook and John Giannandrea in 2005.
...
and has been available as open source since January 2010. On 16 July 2010,
Google
Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
acquired Metaweb, the creators of
Freebase
Freebase may refer to:
*Free base or freebase, the pure basic form of an amine, as opposed to its salt form
*Freebase (database), a former online database service
*Freebase (mixtape), ''Freebase'' (mixtape), 2014 mixtape by 2 Chainz
*An original ...
, and on 10 November 2010 renamed Freebase Gridwords Google Refine, releasing version 2.0. On 2 October 2012, original author David Huynh announced that Google would soon stop its active support of Google Refine. Since then, the codebase has been in transition to an open source project named OpenRefine.
google-refine - Google Refine, a power tool for working with messy data (formerly Freebase Gridworks) - Google Project Hosting
Code.google.com. Retrieved on 2013-08-16.
References
External links
*
OpenRefine Beginners Tutorial by Emma Carroll
{{Google FOSS
Free software
Google software
Data management software
Extract, transform, load tools