Data journalism or data-driven journalism (DDJ) is a journalistic process based on analyzing and filtering large data sets for the purpose of creating or elevating a news story. Data journalism is a type of

journalism Journalism is the production and distribution of reports on the interaction of events, facts, ideas, and people that are the "news of the day" and that informs society to at least some degree. The word, a noun, applies to the occupation (profes ...

reflecting the increased role that numerical data is used in the production and distribution of information in the

digital era The Information Age (also known as the Computer Age, Digital Age, Silicon Age, or New Media Age) is a historical period that began in the mid-20th century. It is characterized by a rapid shift from traditional industries, as established during ...

. It reflects the increased interaction between content producers (

journalist A journalist is an individual that collects/gathers information in form of text, audio, or pictures, processes them into a news-worthy form, and disseminates it to the public. The act or process mainly done by the journalist is called journalism ...

) and several other fields such as

design A design is a plan or specification for the construction of an object or system or for the implementation of an activity or process or the result of that plan or specification in the form of a prototype, product, or process. The verb ''to design'' ...

computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...

and

statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

. From the point of view of journalists, it represents "an overlapping set of competencies drawn from disparate fields". Data journalism has been widely used to unite several concepts and link them to journalism. Some see these as levels or stages leading from the simpler to the more complex uses of new technologies in the journalistic process. Many data-driven stories begin with newly available resources such as

open source software Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Open ...

open access Open access (OA) is a set of principles and a range of practices through which research outputs are distributed online, free of access charges or other barriers. With open access strictly defined (according to the 2001 definition), or libre op ...

publishing and

open data Open data is data that is openly accessible, exploitable, editable and shared by anyone for any purpose. Open data is licensed under an open license. The goals of the open data movement are similar to those of other "open(-source)" movements ...

, while others are products of public records requests or leaked materials. This approach to journalism builds on older practices, most notably on

computer-assisted reporting Computer-assisted reporting describes the use of computers to gather and analyze the data necessary to write news stories. The spread of computers, software and the Internet changed how reporters work. Reporters routinely collect information in dat ...

(CAR) a label used mainly in the US for decades. Other labels for partially similar approaches are "precision journalism", based on a book by Philipp Meyer, published in 1972, where he advocated the use of techniques from social sciences in researching stories. Data-driven journalism has a wider approach. At the core the process builds on the growing availability of open data that is freely available online and analyzed with

open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...

tools. Data-driven journalism strives to reach new levels of service for the public, helping the general public or specific groups or individuals to understand patterns and make decisions based on the findings. As such, data driven journalism might help to put journalists into a role relevant for society in a new way. Telling stories based on the data is the primary goal. The findings from data can be transformed into any form of journalistic writing. Visualizations can be used to create a clear understanding of a complex situation. Furthermore, elements of storytelling can be used to illustrate what the findings actually mean, from the perspective of someone who is affected by a development. This connection between data and story can be viewed as a "new arc" trying to span the gap between developments that are relevant, but poorly understood, to a story that is verifiable, trustworthy, relevant and easy to remember.

Definitions

Antonopoulos and Karyotakis define the practice of data journalism as "a way of enhancing reporting and news writing with the use and examination of statistics in order to provide a deeper insight into a news story and to highlight relevant data. One trend in the digital era of journalism has been to disseminate information to the public via interactive online content through data visualization tools such as tables, graphs, maps, infographics, microsites, and visual worlds. The in-depth examination of such data sets can lead to more concrete results and observations regarding timely topics of interest. In addition, data journalism may reveal hidden issues that seemingly were not a priority in the news coverage". According to architect and multimedia journalist Mirko Lorenz, data-driven journalism is primarily a ''workflow'' that consists of the following elements: ''digging deep'' into data by scraping, cleansing and structuring it, ''filtering'' by mining for specific information, ''visualizing'' and ''making a story''. This process can be extended to provide results that cater to individual interests and the broader public. Data journalism trainer and writer Paul Bradshaw describes the process of data-driven journalism in a similar manner: data must be ''found'', which may require specialized skills like

MySQL MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database o ...

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...

, then ''interrogated'', for which understanding of jargon and statistics is necessary, and finally ''visualized'' and ''mashed'' with the aid of open-source tools. A more results-driven definition comes from data reporter and web strategist Henk van Ess (2012). "Data-driven journalism enables reporters to tell untold stories, find new angles or complete stories via a workflow of finding, processing and presenting significant amounts of data (in any given form) with or without open tools." Van Ess claims that some of the data-driven workflow leads to products that "are not in orbit with the laws of good story telling" because the result emphases on showing the problem, not explaining the problem. "A good data driven production has different layers. It allows you to find personalized that are only important for you, by drilling down to relevant but also enables you to zoom out to get the big picture." In 2013, Van Ess came with a shorter definition in that doesn't involve visualisation per se:"Data journalism can be based on any data that has to be processed first with tools before a relevant story is possible. It doesn't include visualization per se." However, one of the problems for defining data journalism is that many definitions are not clear enough and focus on describing the computational methods of optimization, analysis, and visualization of information.

Emergence as a concept

The term "data journalism" was coined by political commentator

Ben Wattenberg Benjamin Joseph Wattenberg (born Joseph Ben Zion Wattenberg;Roberts, Sam New York ''Times'', June 29, 2015. Retrieved 2015-06-29. August 26, 1933 – June 28, 2015) was an American author, neoconservative political commentator and demographer, ...

through his work starting in the mid-1960s layering narrative with statistics to support the theory that the United States had entered a

golden age The term Golden Age comes from Greek mythology, particularly the ''Works and Days'' of Hesiod, and is part of the description of temporal decline of the state of peoples through five Ages of Man, Ages, Gold being the first and the one during ...

. One of the earliest examples of using computers with journalism dates back to a 1952 endeavor by CBS to use a mainframe computer to predict the outcome of the presidential election, but it wasn't until 1967 that using computers for data analysis began to be more widely adopted. Working for the ''

Detroit Free Press The ''Detroit Free Press'' is the largest daily newspaper in Detroit, Michigan, US. The Sunday edition is titled the ''Sunday Free Press''. It is sometimes referred to as the Freep (reflected in the paper's web address, www.freep.com). It primari ...

'' at the time,

Philip Meyer Philip Meyer is professor emeritus and former holder of the Knight Chair in Journalism at the University of North Carolina at Chapel Hill. He researches in the areas of journalism quality, precision journalism, civic journalism, polling, the news ...

used a mainframe to improve reporting on the riots spreading throughout the city. With a new precedent set for data analysis in journalism, Meyer collaborated with Donald Barlett and James Steele to look at patterns with conviction sentencings in Philadelphia during the 1970s. Meyer later wrote a book titled ''Precision Journalism'' that advocated the use of these techniques for combining data analysis into journalism. Toward the end of the 1980s, significant events began to occur that helped to formally organize the field of computer assisted reporting. Investigative reporter

Bill Dedman Bill Dedman (born 1960) is a Pulitzer Prize-winning American journalist, an investigative reporter for ''Newsday'', and co-author of the biography of reclusive heiress Huguette Clark, '' Empty Mansions: The Mysterious Life of Huguette Clark and ...

of ''

The Atlanta Journal-Constitution ''The Atlanta Journal-Constitution'' is the only major daily newspaper in the metropolitan area of Atlanta, Georgia. It is the flagship publication of Cox Enterprises. The ''Atlanta Journal-Constitution'' is the result of the merger between ...

'' won a

Pulitzer Prize The Pulitzer Prize () is an award for achievements in newspaper, magazine, online journalism, literature, and musical composition within the United States. It was established in 1917 by provisions in the will of Joseph Pulitzer, who had made h ...

in 1989 for ''The Color of Money,'' his 1988 series of stories using CAR techniques to analyze racial discrimination by banks and other mortgage lenders in middle-income black neighborhoods. The

National Institute for Computer Assisted Reporting National may refer to: Common uses * Nation or country ** Nationality – a ''national'' is a person who is subject to a nation, regardless of whether the person has full rights as a citizen Places in the United States * National, Maryland, c ...

(NICAR) was formed at the

Missouri School of Journalism The Missouri School of Journalism at the University of Missouri in Columbia is one of the oldest formal journalism schools in the world. The school provides academic education and practical training in all areas of journalism and strategic comm ...

in collaboration with the

Investigative Reporters and Editors Investigative Reporters and Editors, Inc. (IRE) is a nonprofit organization that focuses on improving the quality of journalism, in particular investigative journalism. Formed in 1975, it presents the IRE Awards and holds conferences and training ...

(IRE). The first conference dedicated to CAR was organized by NICAR in conjunction with James Brown at Indiana University and held in 1990. The NICAR conferences have been held annually since and is now the single largest gathering of data journalists. Although data journalism has been used informally by practitioners of computer-assisted reporting for decades, the first recorded use by a major news organization is ''

The Guardian ''The Guardian'' is a British daily newspaper. It was founded in 1821 as ''The Manchester Guardian'', and changed its name in 1959. Along with its sister papers ''The Observer'' and ''The Guardian Weekly'', ''The Guardian'' is part of the Gu ...

'', which launched its Datablog in March 2009. And although the paternity of the term is disputed, it is widely used since Wikileaks'

Afghan War documents leak The Afghan War documents leak, also called the Afghan War Diary, is the disclosure of a collection of internal U.S. military logs of the War in Afghanistan, which were published by WikiLeaks on 2010. The logs consist of over 91,000 Afghan War ...

in July, 2010. ''The Guardian'' coverage of the war logs took advantage of free data visualization tools such as

Google Fusion Tables Google Fusion Tables was a web service provided by Google for data management. Fusion tables can be used for gathering, visualising and sharing data tables. Data are stored in multiple tables that Internet users can view and download. The web s ...

, another common aspect of data journalism. ''Facts are Sacred'' by ''The Guardian'' Datablog editor

Simon Rogers Simon Rogers is an English musician, record producer and composer who has been a member of The Fall, and The Lightning Seeds. Biography In 1976, Rogers entered the Royal College of Music, London, later becoming an associate (ARCM) and winni ...

describes data journalism like this: Investigative data journalism combines the field of data journalism with investigative reporting. An example of investigative data journalism is the research of large amounts of textual or financial data. Investigative data journalism also can relate to the field of

big data analytics Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...

for the processing of large data sets. Since the introduction of the concept a number of media companies have created "data teams" which develop visualizations for newsrooms. Most notable are teams e.g. at Reuters, Pro Publica, and ''La Nacion'' (Argentina). In Europe, ''The Guardian'' and ''Berliner Morgenpost'' have very productive teams, as well as public broadcasters. As projects like the MP expense scandal (2009) and the 2013 release of the "offshore leaks" demonstrate, data-driven journalism can assume an investigative role, dealing with "not-so open" aka secret data on occasion. The annual Data Journalism Awards recognize outstanding reporting in the field of data journalism, and numerous

Pulitzer Prizes The Pulitzer Prize () is an award for achievements in newspaper, magazine, online journalism, literature, and musical composition within the United States. It was established in 1917 by provisions in the will of Joseph Pulitzer, who had made hi ...

in recent years have been awarded to data-driven storytelling, including the 2018 Pulitzer Prize in International Reporting and the 2017 Pulitzer Prize in Public Service

Data quality

In many investigations the data that can be found might have omissions or is misleading. As one layer of data-driven journalism a critical examination of the data quality is important. In other cases the data might not be public or is not in the right format for further analysis, e.g. is only available in a

PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...

. Here the process of data-driven journalism can turn into stories about data quality or refusals to provide the data by institutions. As the practice as a whole is in early development steps, examinations of data sources, data sets, data quality and data format are therefore an equally important part of this work.

Data-driven journalism and the value of trust

Based on the perspective of looking deeper into facts and drivers of events, there is a suggested change in media strategies: In this view the idea is to move "from attention to trust". The creation of attention, which has been a pillar of media business models has lost its relevance because reports of new events are often faster distributed via new platforms such as Twitter than through traditional media channels. On the other hand, trust can be understood as a scarce resource. While distributing information is much easier and faster via the web, the abundance of offerings creates costs to verify and check the content of any story create an opportunity. The view to transform media companies into trusted data hubs has been described in an article cross-published in February 2011 on Owni.eu and Nieman Lab.

Process of data-driven journalism

The process to transform raw data into stories is akin to a refinement and transformation. The main goal is to extract information recipients can act upon. The task of a data journalist is to extract what is hidden. This approach can be applied to almost any context, such as finances, health, environment or other areas of public interest.

Inverted pyramid of data journalism

In 2011, Paul Bradshaw introduced a model, he calle
"The Inverted Pyramid of Data Journalism"

Steps of the process

In order to achieve this, the process should be split up into several steps. While the steps leading to results can differ, a basic distinction can be made by looking at six phases: # Find: Searching for data on the web # Clean: Process to filter and transform data, preparation for visualization # Visualize: Displaying the pattern, either as a static or animated visual # Publish: Integrating the visuals, attaching data to stories # Distribute: Enabling access on a variety of devices, such as the web, tablets and mobile # Measure: Tracking usage of data stories over time and across the spectrum of uses.

Description of the steps

Finding data

Data can be obtained directly from governmental databases such as

data.gov Data.gov is a U.S. Government website launched in late May 2009 by the Federal Chief Information Officer (CIO) of the United States, Vivek Kundra. Data.gov aims to improve public access to high value, machine readable datasets generated by t ...

data.gov.uk data.gov.uk is a UK Government project to make available non-personal UK government data as open data. It was launched in closed beta in September 2009 and publicly launched in January 2010. As of February 2015 it contained over 19,343 datasets, r ...

and World Bank Data API but also by placing

Freedom of Information request Freedom of information laws allow access by the general public to data held by national governments and, where applicable, by state and local governments. The emergence of freedom of information legislation was a response to increasing dissatisfa ...

s to government agencies; some requests are made and aggregated on websites like the UK's What Do They Know. While there is a worldwide trend towards opening data, there are national differences as to what extent that information is freely available in usable formats. If the data is in a webpage, scrapers are used to generate a spreadsheet. Examples of scrapers are
WebScraper
Import.io, QuickCode,

OutWit Hub OutWit Hub is a Web scraping, Web data extraction software application designed to automatically extract information from online or local resources. It recognizes and grabs links, images, documents, contacts, recurring vocabulary and phrases, rss f ...

and Needlebase (retired in 2012). In other cases OCR software can be used to get data from PDFs. Data can also be created by the public through crowd sourcing, as shown in March 2012 at the Datajournalism Conference in Hamburg by Henk van Ess.

Cleaning data

Usually data is not in a format that is easy to visualize. Examples are that there are too many data points or that the rows and columns need to be sorted differently. Another issue is that once investigated many datasets need to be cleaned, structured and transformed. Various tools like

OpenRefine OpenRefine is an open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling. It is similar to spreadsheet applications, and can handle spreadsheet file formats such as CSV, ...

(

), Data Wrangler and

Google Spreadsheet Google Sheets is a spreadsheet program included as part of the free, web-based Google Docs Editors suite offered by Google. The service also includes: Google Docs, Google Slides, Google Drawings, Google Forms, Google Sites and Google Keep. Googl ...

s allow uploading, extracting or formatting data.

Visualizing data

To visualize data in the form of graphs and charts, applications such as Many Eyes or Tableau Public are available.

Yahoo! Pipes Yahoo! Pipes was a web application from Yahoo! that provided a graphical user interface for building data mashups that aggregate web feeds, web pages, and other services; creating Web-based apps from various sources; and publishing those apps. ...

and Open Heat Map are examples of tools that enable the creation of maps based on data spreadsheets. The number of options and platforms is expanding. Some new offerings provide options to search, display and embed data, an example being Timetric. To create meaningful and relevant visualizations, journalists use a growing number of tools. There are by now, several descriptions what to look for and how to do it. Most notable published articles are: * Joel Gunter: "#ijf11: Lessons in data journalism from the New York Times" * Steve Myers: "Using Data Visualization as a Reporting Tool Can Reveal Story’s Shape", including a link to a tutorial by Sarah Cohen As of 2011, the use of HTML 5 libraries using the

canvas Canvas is an extremely durable plain-woven fabric used for making sails, tents, marquees, backpacks, shelters, as a support for oil painting and for other items for which sturdiness is required, as well as in such fashion objects as handbags ...

tag is gaining in popularity. There are numerous libraries enabling to graph data in a growing variety of forms. One example is

RGraph RGraph is an HTML5 software library for charting written in native JavaScript. It was created in 2008. RGraph started as an easy-to-use commercial tool based on HTML5 canvas only. It became freely available to use under the open-source MIT licens ...

. As of 2011 there is a growing list of JavaScript libraries allowing to visualize data.

Publishing data story

There are different options to publish data and visualizations. A basic approach is to attach the data to single stories, similar to embedding web videos. More advanced concepts allow to create single dossiers, e.g. to display a number of visualizations, articles and links to the data on one page. Often such specials have to be coded individually, as many Content Management Systems are designed to display single posts based on the date of publication.

Distributing data

Providing access to existing data is another phase, which is gaining importance. Think of the sites as "marketplaces" (commercial or not), where datasets can be found easily by others. Especially of the insights for an article where gained from Open Data, journalists should provide a link to the data they used for others to investigate (potentially starting another cycle of interrogation, leading to new insights). Providing access to data and enabling groups to discuss what information could be extracted is the main idea behind Buzzdata, a site using the concepts of social media such as sharing and following to create a community for data investigations. Other platforms (which can be used both to gather or to distribute data): * Help Me Investigate (created by Paul Bradshaw) * Timetric * ScraperWiki

Measuring the impact of data stories

A final step of the process is to measure how often a dataset or visualization is viewed. In the context of data-driven journalism, the extent of such tracking, such as collecting user data or any other information that could be used for marketing reasons or other uses beyond the control of the user, should be viewed as problematic. One newer, non-intrusive option to measure usage is a lightweight tracker called PixelPing. The tracker is the result of a project by

ProPublica ProPublica (), legally Pro Publica, Inc., is a nonprofit organization based in New York City. In 2010, it became the first online news source to win a Pulitzer Prize, for a piece written by one of its journalists''The Guardian'', April 13, 2010P ...

and

DocumentCloud DocumentCloud is an open-source software as a service platform that allows users to upload, analyze, annotate, collaborate on and publish primary source documents. Since its launch in 2009, it has been used primarily by journalists to find informa ...

. There is a corresponding service to collect the data. The software is open source and can be downloaded via GitHub.

Examples

There is a growing list of examples how data-driven journalism can be applied. ''The Guardian'', one of the pioneering media companies in this space (see "Data journalism at the Guardian: what is it and how do we do it?"), has compiled an extensive list of data stories, see: "All of our data journalism in one spreadsheet". Other prominent uses of data-driven journalism are related to the release by whistle-blower organization

WikiLeaks WikiLeaks () is an international Nonprofit organization, non-profit organisation that published news leaks and classified media provided by anonymous Source (journalism), sources. Julian Assange, an Australian Internet activism, Internet acti ...

of the Afghan War Diary, a compendium of 91,000 secret military reports covering the war in Afghanistan from 2004 to 2010. Three global broadsheets, namely ''

'', ''

The New York Times ''The New York Times'' (''the Times'', ''NYT'', or the Gray Lady) is a daily newspaper based in New York City with a worldwide readership reported in 2020 to comprise a declining 840,000 paid print subscribers, and a growing 6 million paid ...

'' and ''

Der Spiegel ''Der Spiegel'' (, lit. ''"The Mirror"'') is a German weekly news magazine published in Hamburg. With a weekly circulation of 695,100 copies, it was the largest such publication in Europe in 2011. It was founded in 1947 by John Seymour Chaloner ...

'', dedicated extensive sections to the documents;

's reporting included an interactive map pointing out the type, location and casualties caused by 16,000 IED attacks,

published a selection of reports that permits rolling over underlined text to reveal explanations of military terms, while

provided hybrid visualizations (containing both graphs and maps) on topics like the number deaths related to insurgent bomb attacks. For the Iraq War logs release, ''The Guardian'' used

to create an interactive map of every incident where someone died, a technique it used again in the England riots of 2011.UK riots: every verified incident - interactive map
11 August 2011, ''Guardian Datablog''

References

{{Reflist

External links

National Institute for Computer-Assisted Reporting website

DataJournalism.com, learn Data Journalism by reading, watching and discussions

List of data journalism university courses and programmes from around the world

''The Data Journalism Handbook: Towards A Critical Data Practice''

open access handbook
on data journalism around the world