HOME

TheInfoList



OR:

OpenRefine is an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling. It is similar to
spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in c ...
applications, and can handle spreadsheet file formats such as CSV, but it behaves more like a database. It operates on ''rows'' of data which have cells under ''columns,'' similar to the manner in which
relational database A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
tables operate. OpenRefine projects consist of one table, whose rows can be filtered using ''facets'' that define criteria (for example, showing rows where a given column is not empty). Unlike spreadsheets, most operations in OpenRefine are done on all visible rows, for example, the transformation of all cells in all rows under one column, or the creation of a new column based on existing data. Actions performed on a dataset are stored the project and can be 'replayed' on other datasets. Formulas are not stored in cells, but are used to transform the data. Transformation is done only once. Formula expressions can be written in General Refine Expression Language (GREL), in
Jython Jython is an implementation of the Python (programming language), Python programming language designed to run on the Java (programming language), Java platform. It was known as JPython until 1999. Overview Jython programs can import and use any ...
(i.e., Python), and in
Clojure Clojure (, like ''closure'') is a dynamic programming language, dynamic and functional programming, functional dialect (computing), dialect of the programming language Lisp (programming language), Lisp on the Java (software platform), Java platfo ...
. The program operates as a local web app: it starts a
web server A web server is computer software and underlying Computer hardware, hardware that accepts requests via Hypertext Transfer Protocol, HTTP (the network protocol created to distribute web content) or its secure variant HTTPS. A user agent, co ...
and opens the default browser to 127.0.0.1:3333.


Uses

* ''Cleaning messy data'': for example if working with a text file with some semi-structured data, it can be edited using transformations, facets and clustering to make the data cleanly structured. * ''Transformation of data'': converting values to other formats, normalizing and denormalizing. * ''Parsing data from web sites'': OpenRefine has a URL fetch feature and jsoup HTML parser and DOM engine. * ''Adding data to dataset by fetching it from web services'' (i.e. returning
JSON JSON (JavaScript Object Notation, pronounced or ) is an open standard file format and electronic data interchange, data interchange format that uses Human-readable medium and data, human-readable text to store and transmit data objects consi ...
). For example, can be used for
geocoding Address geocoding, or simply geocoding, is the process of taking a text-based description of a location, such as an address or the name of a place, and returning geographic coordinates, frequently latitude/longitude pair, to identify a locati ...
addresses to
geographic coordinates A geographic coordinate system (GCS) is a spherical or geodetic coordinate system for measuring and communicating positions directly on Earth as latitude and longitude. It is the simplest, oldest, and most widely used type of the various ...
. * ''Aligning to
Wikidata Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, are able to use under the CC0 public domain ...
'' (formerly Freebase): this involves ''reconciliation'' — mapping string values in cells to entities in Wikidata.


Supported formats

Import is supported from following formats: * TSV, CSV * Text file with custom separators or columns split by fixed width *
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
* RDF triples (
RDF/XML RDF/XML is a syntax,RDF/XML Syntax Specification
Notation3 serialization formats) *
JSON JSON (JavaScript Object Notation, pronounced or ) is an open standard file format and electronic data interchange, data interchange format that uses Human-readable medium and data, human-readable text to store and transmit data objects consi ...
*
Google Spreadsheets Google Sheets is a spreadsheet application and part of the free, web-based Google Docs Editors suite offered by Google. Google Sheets is available as a web application; a mobile app for: Android, iOS, and as a desktop application on Google's C ...
If input data is in a non-standard text format, it can be imported as whole lines, without splitting into columns, and then columns extracted later with OpenRefine's tools. Archived and compressed files are supported (.zip, .tar.gz, .tgz, .tar.bz2, .gz, or .bz2) and Refine can download input files from a
URL A uniform resource locator (URL), colloquially known as an address on the Web, is a reference to a resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identi ...
. To use web pages as input, it is possible to import a list of URLs and then invoke a URL fetch function. Export is supported in following formats: * TSV * CSV *
Microsoft Excel Microsoft Excel is a spreadsheet editor developed by Microsoft for Microsoft Windows, Windows, macOS, Android (operating system), Android, iOS and iPadOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a ...
*
HTML table An HTML element is a type of HTML (HyperText Markup Language) document component, one of several types of HTML nodes (there are also text nodes, comment nodes and others). The first used version of HTML was written by Tim Berners-Lee in 1993 ...
*
Google Spreadsheets Google Sheets is a spreadsheet application and part of the free, web-based Google Docs Editors suite offered by Google. Google Sheets is available as a web application; a mobile app for: Android, iOS, and as a desktop application on Google's C ...
* Templating exporter: it is possible to define custom template for outputting data, for example as
MediaWiki MediaWiki is free and open-source wiki software originally developed by Magnus Manske for use on Wikipedia on January 25, 2002, and further improved by Lee Daniel Crocker,mailarchive:wikipedia-l/2001-August/000382.html, Magnus Manske's announc ...
table. Whole OpenRefine projects in native format can be exported as a .tar.gz archive.


Development

OpenRefine started life as Freebase Gridworks, developed by
Metaweb Metaweb Technologies, Inc. was a San Franciscobased company that developed Freebase, described as an "open, shared database of the world's knowledge". The company was co-founded by Danny Hillis, Veda Hlubinka-Cook and John Giannandrea in 2005. ...
and has been available as open source since January 2010. On 16 July 2010,
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
acquired Metaweb, the creators of Freebase, and on 10 November 2010 renamed Freebase Gridwords Google Refine, releasing version 2.0. On 2 October 2012, original author David Huynh announced that Google would soon stop its active support of Google Refine. Since then, the codebase has been in transition to an open source project named OpenRefine.google-refine - Google Refine, a power tool for working with messy data (formerly Freebase Gridworks) - Google Project Hosting
Code.google.com. Retrieved on 2013-08-16.


References


External links

*
OpenRefine Beginners Tutorial by Emma Carroll
{{Google FOSS Free software programmed in Java (programming language) Google software Data management software Extract, transform, load tools Cross-platform free software Free software for Linux Free software for Windows Free software for macOS Software using the BSD license