Data Commons
   HOME

TheInfoList



OR:

Data Commons is an open-source platform created by
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
that provides an
open knowledge Open knowledge (or free knowledge) is knowledge that is free to use, reuse, and redistribute without legal, social, or technological restriction. Open knowledge organizations and activists have proposed principles and methodologies related to the ...
graph, combining economic, scientific and other public datasets into a unified view. Ramanathan V. Guha, a creator of web standards including RDF,
RSS RSS ( RDF Site Summary or Really Simple Syndication) is a web feed that allows users and applications to access updates to websites in a standardized, computer-readable format. Subscribing to RSS feeds can allow a user to keep track of many ...
, and Schema.org, founded the project, which is now led by Prem Ramaswami. The Data Commons website was launched in May 2018 with an initial dataset consisting of
fact-checking Fact-checking is the process of verifying the factual accuracy of questioned reporting and statements. Fact-checking can be conducted before or after the text or content is published or otherwise disseminated. Internal fact-checking is such che ...
data published in Schema.org "ClaimReview" format by several fact checkers from the
International Fact-Checking Network The Poynter Institute for Media Studies is a non-profit journalism school and research organization in St. Petersburg, Florida, United States. The school is the owner of the ''Tampa Bay Times'' newspaper and the International Fact-Checking Netwo ...
. Google has worked with partners such as the
United Nations The United Nations (UN) is the Earth, global intergovernmental organization established by the signing of the Charter of the United Nations, UN Charter on 26 June 1945 with the stated purpose of maintaining international peace and internationa ...
(UN) to populate the repository, which also includes data from the
United States Census The United States census (plural censuses or census) is a census that is legally mandated by the Constitution of the United States. It takes place every ten years. The first census after the American Revolution was taken in 1790 United States ce ...
, the
World Bank The World Bank is an international financial institution that provides loans and Grant (money), grants to the governments of Least developed countries, low- and Developing country, middle-income countries for the purposes of economic development ...
, the
US Bureau of Labor Statistics The Bureau of Labor Statistics (BLS) is a unit of the United States Department of Labor. It is the principal fact-finding agency for the U.S. government in the broad field of labor economics and statistics and serves as a principal agency of ...
,
Wikipedia Wikipedia is a free content, free Online content, online encyclopedia that is written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and the wiki software MediaWiki. Founded by Jimmy Wales and La ...
, the
National Oceanic and Atmospheric Administration The National Oceanic and Atmospheric Administration (NOAA ) is an American scientific and regulatory agency charged with Weather forecasting, forecasting weather, monitoring oceanic and atmospheric conditions, Hydrography, charting the seas, ...
and the
Federal Bureau of Investigation The Federal Bureau of Investigation (FBI) is the domestic Intelligence agency, intelligence and Security agency, security service of the United States and Federal law enforcement in the United States, its principal federal law enforcement ag ...
. The service expanded during 2019 to include an RDF-style
knowledge graph In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a Graph (discrete mathematics), graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interl ...
populated from a number of largely statistical open datasets. The service was announced to a wider audience in 2019. In 2020 the service improved its coverage of non-US datasets, while also increasing its coverage of
bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
and
coronavirus Coronaviruses are a group of related RNA viruses that cause diseases in mammals and birds. In humans and birds, they cause respiratory tract infections that can range from mild to lethal. Mild illnesses in humans include some cases of the comm ...
. In 2023, the service relaunched with a natural-language front end powered by a
large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...
. It also launched as the back end to the UN data portal with
Sustainable Development Goals The ''2030 Agenda for Sustainable Development'', adopted by all United Nations (UN) members in 2015, created 17 world Sustainable Development Goals (SDGs). The aim of these global goals is "peace and prosperity for people and the planet" – wh ...
data.


Features

Data Commons places more emphasis on statistical data than is common for
linked data In computing, linked data is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web ...
and
knowledge graph In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a Graph (discrete mathematics), graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interl ...
initiatives. It includes geographical, demographic, weather and real estate data alongside other categories, describing states, Congressional districts, and cities in the United States as well as biological specimens, power plants, and elements of the
human genome The human genome is a complete set of nucleic acid sequences for humans, encoded as the DNA within each of the 23 distinct chromosomes in the cell nucleus. A small DNA molecule is found within individual Mitochondrial DNA, mitochondria. These ar ...
via the Encyclopedia of DNA Elements (ENCODE) project. It represents data as
semantic triple A semantic triple, or RDF triple or simply triple, is the atomic data entity in the Resource Description Framework (RDF) data model. As its name indicates, a triple is a sequence of three entities that codifies a statement about semantic data in ...
s each of which can have its own provenance. It centers on the entity-oriented integration of statistical observations from a variety of public datasets. Although it supports a subset of the W3C SPARQL query language, its
API An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...
s also include tools — such as a
Pandas Pediatric autoimmune neuropsychiatric disorders associated with streptococcal infections (PANDAS) is a controversial hypothetical diagnosis for a subset of children with rapid onset of obsessive-compulsive disorder (OCD) or tic disorders. Sy ...
dataframe interface — oriented towards data science, statistics and data visualization. Data Commons is integrative, meaning that it does not provide a hosting platform for different datasets, but rather attempts to consolidate much of the information provided by the datasets into a single data graph.


Technology

Data Commons is built on a graph data-model. The graph can be accessed through a browser interface and several APIs, and is expanded through loading data (typically CSV and MCF-based templates). The graph can be accessed by natural language queries in
Google Search Google Search (also known simply as Google or Google.com) is a search engine operated by Google. It allows users to search for information on the World Wide Web, Web by entering keywords or phrases. Google Search uses algorithms to analyze an ...
. The data vocabulary used to define the datacommons.org graph is based upon Schema.org. In particular the Schema.org terms StatisticalPopulation and Observation were proposed to Schema.org to support datacommons-like use cases. Software from the project is available on
GitHub GitHub () is a Proprietary software, proprietary developer platform that allows developers to create, store, manage, and share their code. It uses Git to provide distributed version control and GitHub itself provides access control, bug trackin ...
under
Apache 2 license The Apache License is a permissive free software license written by the Apache Software Foundation (ASF). It allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software u ...
.


References


External links

*
GitHub repository

Veri Setleri
{{Google LLC Google Knowledge graphs Open data