DBpedia (from "DB" for "
database
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
") is a project aiming to extract
structured content
Structured content is information or content that is organized in a predictable way and is usually classified with metadata. XML is a common storage format, but structured content can also be stored in other standard or proprietary formats.
Whe ...
from the information created in the
Wikipedia
Wikipedia is a free content, free Online content, online encyclopedia that is written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and the wiki software MediaWiki. Founded by Jimmy Wales and La ...
project. This structured information is made available on the
World Wide Web
The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
using
OpenLink Virtuoso.
DBpedia allows users to
semantically query relationships and properties of Wikipedia resources, including links to other related
dataset
A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record o ...
s.
The project was heralded as "one of the more famous pieces" of the decentralized
Linked Data
In computing, linked data is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web ...
effort by
Tim Berners-Lee
Sir Timothy John Berners-Lee (born 8 June 1955), also known as TimBL, is an English computer scientist best known as the inventor of the World Wide Web, the HTML markup language, the URL system, and HTTP. He is a professorial research fellow a ...
, one of the Web's pioneers.
As of June 2021, DBPedia contained over 850 million
triples
TripleS (; ; stylized as tripleS) is a South Korean 24-member multinational girl group formed by Modhaus. They aim to be the world's first decentralized idol group, where the members will rotate between the full group, sub-units, and solo activi ...
.
Background
The project was started by people at the
Free University of Berlin
The Free University of Berlin (, often abbreviated as FU Berlin or simply FU) is a public university, public research university in Berlin, Germany. It was founded in West Berlin in 1948 with American support during the early Cold War period a ...
and
Leipzig University
Leipzig University (), in Leipzig in Saxony, Germany, is one of the world's oldest universities and the second-oldest university (by consecutive years of existence) in Germany. The university was founded on 2 December 1409 by Frederick I, Electo ...
[''DBpedia: A Nucleus for a Web of Open Data'', available a]
o
/ref> in collaboration with OpenLink Software, and is now maintained by people at the University of Mannheim
The University of Mannheim (German: ''Universität Mannheim''), abbreviated UMA, is a public university, public research university in Mannheim, Baden-Württemberg, Germany. Founded in 1967, the university has its origins in the ''Palatine Aca ...
and Leipzig University. The first publicly available dataset was published in 2007. The data is made available under free license
A free license or open license is a license that allows copyrighted work to be reused, modified, and redistributed. These uses are normally prohibited by copyright, patent or other Intellectual property (IP) laws. The term broadly covers '' fr ...
s (CC BY-SA
A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work". A CC license is used when an author wants to give other people the right to share, use, and bui ...
), allowing others to reuse the dataset; it does not use an open data
Open data are data that are openly accessible, exploitable, editable and shareable by anyone for any purpose. Open data are generally licensed under an open license.
The goals of the open data movement are similar to those of other "open(-so ...
license to waive the sui generis database right
A database right is a ''sui generis'' property right, comparable to but distinct from copyright, that exists to recognise the investment that is made in compiling a database, even when this does not involve the " creative" aspect that is reflect ...
s.
Wikipedia articles consist mostly of free text, but also include structured information embedded in the articles, such as "infobox
An infobox is a digital or physical Table (information), table used to collect and present a subset of information about its subject, such as a document. It is a structured document containing a set of attribute–value pairs, and in Wikipedia r ...
" tables (the pull-out panels that appear in the top right of the default view of many Wikipedia articles, or at the start of the mobile versions), categorization information, images, geo-coordinates and links to external Web page
A web page (or webpage) is a World Wide Web, Web document that is accessed in a web browser. A website typically consists of many web pages hyperlink, linked together under a common domain name. The term "web page" is therefore a metaphor of pap ...
s. This structured information is extracted and put in a uniform dataset which can be queried.
Dataset
The 2016-04 release of the DBpedia data set describes 6.0 million entities, out of which 5.2 million are classified in a consistent ontology
Ontology is the philosophical study of existence, being. It is traditionally understood as the subdiscipline of metaphysics focused on the most general features of reality. As one of the most fundamental concepts, being encompasses all of realit ...
, including 1.5 million persons, 810,000 places, 135,000 music albums, 106,000 films, 20,000 video games, 275,000 organizations, 301,000 species and 5,000 diseases. DBpedia uses the Resource Description Framework
The Resource Description Framework (RDF) is a method to describe and exchange graph data. It was originally designed as a data model for metadata by the World Wide Web Consortium (W3C). It provides a variety of syntax notations and formats, of whi ...
(RDF) to represent extracted information and consists of 9.5 billion RDF triples, of which 1.3 billion were extracted from the English edition of Wikipedia and 5.0 billion from other language editions.
From this data set, information spread across multiple pages can be extracted. For example, book authorship can be put together from pages about the work, or the author.
One of the challenges in extracting information from Wikipedia is that the same concepts
A concept is an abstract idea that serves as a foundation for more concrete principles, thoughts, and beliefs.
Concepts play an important role in all aspects of cognition. As such, concepts are studied within such disciplines as linguistics, psy ...
can be expressed using different parameters in infobox and other templates, such as and . Because of this, queries about where people were born would have to search for both of these properties in order to get more complete results. As a result, the DBpedia Mapping Language has been developed to help in mapping these properties to an ontology while reducing the number of synonyms. Due to the large diversity of infoboxes and properties in use on Wikipedia, the process of developing and improving these mappings has been opened to public contributions.
Version 2014 was released in September 2014. A main change since previous versions was the way abstract texts were extracted. Specifically, running a local mirror of Wikipedia and retrieving rendered abstracts from it made extracted texts considerably cleaner. Also, a new data set extracted from Wikimedia Commons
Wikimedia Commons, or simply Commons, is a wiki-based Digital library, media repository of Open content, free-to-use images, sounds, videos and other media. It is a project of the Wikimedia Foundation.
Files from Wikimedia Commons can be used ...
was introduced.
As of June 2021, DBPedia contains over 850 million triples.
Examples
DBpedia extracts factual information from Wikipedia pages, allowing users to find answers to questions where the information is spread across multiple Wikipedia articles. Data is accessed using an SQL
Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel")
is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
-like query language
A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. In database systems, query languages rely on strict theory to retrieve informa ...
for RDF called SPARQL
SPARQL (pronounced ":wikt:sparkle, sparkle", a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language—that is, a Semantic Query, semantic query language for databases—able to retrieve and manipulate data sto ...
.
For example, if one were interested in the Japanese
Japanese may refer to:
* Something from or related to Japan, an island country in East Asia
* Japanese language, spoken mainly in Japan
* Japanese people, the ethnic group that identifies with Japan through ancestry or culture
** Japanese diaspor ...
''shōjo'' manga series ''Tokyo Mew Mew
is a Japanese manga series created and written by Reiko Yoshida and illustrated by Mia Ikumi. It was originally serialized in Kodansha's ''shōjo'' manga magazine ''Nakayoshi'' from September 2000 to February 2003, with its chapters ...
'', and wanted to find the genres of other works written by its illustrator Mia Ikumi. DBpedia combines information from Wikipedia's entries on ''Tokyo Mew Mew'', Mia Ikumi and on this author's works such as ''Super Doll Licca-chan
is a Japanese anime television series based on the Licca-chan fashion doll, which ran on TV Tokyo in 1998–1999. Kodansha also serialized a manga based on the anime series in its monthly manga magazine ''Nakayoshi''. The story follows an o ...
'' and ''Koi Cupid''. Since DBpedia normalises information into a single database, the followin
query
can be asked without needing to know exactly which entry carries each fragment of information, and will list related genres:
PREFIX dbprop:
PREFIX db:
SELECT ?who, ?WORK, ?genre WHERE
Use cases
DBpedia has a broad scope of entities covering different areas of human knowledge
Knowledge is an Declarative knowledge, awareness of facts, a Knowledge by acquaintance, familiarity with individuals and situations, or a Procedural knowledge, practical skill. Knowledge of facts, also called propositional knowledge, is oft ...
. This makes it a natural hub for connecting datasets, where external datasets could link to its concepts. The DBpedia dataset is interlinked on the RDF level with various other Open Data
Open data are data that are openly accessible, exploitable, editable and shareable by anyone for any purpose. Open data are generally licensed under an open license.
The goals of the open data movement are similar to those of other "open(-so ...
datasets on the Web. This enables applications to enrich DBpedia data with data from these datasets. , there are more than 45 million interlinks between DBpedia and external datasets including: Freebase, OpenCyc, UMBEL
UMBEL (Upper Mapping and Binding Exchange Layer) is a logically organized knowledge graph of 34,000 concepts and entity types that can be used in information science for relating information from disparate sources to one another. It was retired ...
, GeoNames
GeoNames (or GeoNames.org) is a user-editable geographical database available and accessible through various web services, under a Creative Commons attribution license. The project was founded in late 2005.
The GeoNames dataset differs from, b ...
, MusicBrainz
MusicBrainz is a MetaBrainz project that aims to create a collaborative music database that is similar to the freedb project. MusicBrainz was founded in response to the restrictions placed on the CDDB, Compact Disc Database (CDDB), a database for ...
, CIA World Fact Book
''The World Factbook'', also known as the ''CIA World Factbook'', is a reference resource produced by the United States' Central Intelligence Agency (CIA) with almanac-style information about the countries of the world. The official print ve ...
, DBLP
DBLP is a computer science bibliography website. Starting in 1993 at Universität Trier in Germany, it grew from a small collection of HTML files and became an organization hosting a database and logic programming bibliography site. Since Novem ...
, Project Gutenberg
Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, as well as to "encourage the creation and distribution of eBooks."
It was founded in 1971 by American writer Michael S. Hart and is the oldest digital li ...
, DBtune Jamendo
Jamendo is a Luxembourg-based music website and an open community of independent artists and music lovers. A subsidiary of Belgian company Llama Group, and Independent Management Entity (IME) since 2019.
Originally, Jamendo was a music platform ...
, Eurostat
Eurostat ("European Statistical Office"; also DG ESTAT) is a department of the European Commission ( Directorate-General), located in the Kirchberg quarter of Luxembourg City, Luxembourg. Eurostat's main responsibilities are to provide statist ...
, UniProt
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...
, Bio2RDF, and US Census
The United States census (plural censuses or census) is a census that is legally mandated by the Constitution of the United States. It takes place every ten years. The first census after the American Revolution was taken in 1790 under Secretar ...
data. The Thomson Reuters
Thomson Reuters Corporation ( ) is a Canadian multinational corporation, multinational content-driven technology Conglomerate (company), conglomerate. The company was founded in Toronto, Ontario, Canada, and maintains its headquarters at 1 ...
initiative OpenCalais, the Linked Open Data project of ''The New York Times
''The New York Times'' (''NYT'') is an American daily newspaper based in New York City. ''The New York Times'' covers domestic, national, and international news, and publishes opinion pieces, investigative reports, and reviews. As one of ...
'', the Zemanta API and DBpedia Spotlight also include links to DBpedia. The BBC
The British Broadcasting Corporation (BBC) is a British public service broadcaster headquartered at Broadcasting House in London, England. Originally established in 1922 as the British Broadcasting Company, it evolved into its current sta ...
uses DBpedia to help organize its content. Faviki uses DBpedia for semantic tagging. Samsung
Samsung Group (; stylised as SΛMSUNG) is a South Korean Multinational corporation, multinational manufacturing Conglomerate (company), conglomerate headquartered in the Samsung Town office complex in Seoul. The group consists of numerous a ...
also includes DBpedia in it
"Knowledge Sharing Platform"
Such a rich source of structured cross-domain knowledge is fertile ground for artificial intelligence
Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
systems. DBpedia was used as one of the knowledge sources in IBM Watson
IBM Watson is a computer system capable of answering questions posed in natural language. It was developed as a part of IBM's DeepQA project by a research team, led by principal investigator David Ferrucci. Watson was named after IBM's fou ...
's Jeopardy!
''Jeopardy!'' is an American television game show created by Merv Griffin. The show is a quiz competition that reverses the traditional question-and-answer format of many quiz shows. Rather than being given questions, contestants are instead g ...
winning system.
Amazon
Amazon most often refers to:
* Amazon River, in South America
* Amazon rainforest, a rainforest covering most of the Amazon basin
* Amazon (company), an American multinational technology company
* Amazons, a tribe of female warriors in Greek myth ...
provides a DBpedia ''Public Data Set'' that can be integrated into Amazon Web Services
Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
applications.
Data about creators from DBpedia can be used for enriching artworks' sales observations.
The crowdsourcing
Crowdsourcing involves a large group of dispersed participants contributing or producing goods or services—including ideas, votes, micro-tasks, and finances—for payment or as volunteers. Contemporary crowdsourcing often involves digit ...
software company, Ushahidi
Ushahidi is an open source software application that collates and maps data using user-generated reports. It uses the concept of crowdsourcing serving as an initial model for what has been coined as "activist mapping" – the combination of soc ...
, built a prototype of its software that leveraged DBpedia to perform semantic annotations on citizen-generated reports. The prototype incorporated the "YODIE" (Yet another Open Data Information Extraction system) service developed by the University of Sheffield
The University of Sheffield (informally Sheffield University or TUOS) is a public university, public research university in Sheffield, South Yorkshire, England. Its history traces back to the foundation of Sheffield Medical School in 1828, Fir ...
, which uses DBpedia to perform the annotations. The goal for Ushahidi was to improve the speed and facility with which incoming reports could be validated managed.
DBpedia Spotlight
DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text. This allows linking unstructured information sources to the Linked Open Data
In computing, linked data is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web ...
cloud through DBpedia. DBpedia Spotlight performs named entity extraction
Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pr ...
, including entity detection and name resolution (in other words, disambiguation). It can also be used for named entity recognition
Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pr ...
, and other information extraction tasks. DBpedia Spotlight aims to be customizable for many use cases. Instead of focusing on a few entity types, the project strives to support the annotation of all 3.5million entities and concepts from more than 320 classes in DBpedia. The project started in June 2010 at the Web Based Systems Group at the Free University of Berlin.
DBpedia Spotlight is publicly available as a web service
A web service (WS) is either:
* a service offered by an electronic device to another electronic device, communicating with each other via the Internet, or
* a server running on a computer device, listening for requests at a particular port over a n ...
for testing and a Java
Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
/ Scala API
An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...
licensed via the Apache License
The Apache License is a permissive free software license written by the Apache Software Foundation (ASF). It allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software ...
. The DBpedia Spotlight distribution includes a jQuery plugin that allows developers to annotate pages anywhere on the Web by adding one line to their page. Clients are also available in Java or PHP
PHP is a general-purpose scripting language geared towards web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by the PHP Group. ...
. The tool handles various languages through its demo page and web services. Internationalization is supported for any language that has a Wikipedia edition.
Archivo ontology database
From 2020, the DBpedia project provides a regularly updated database of web‑accessible ontologies written in the OWL
Owls are birds from the order Strigiformes (), which includes over 200 species of mostly solitary and nocturnal birds of prey typified by an upright stance, a large, broad head, binocular vision, binaural hearing, sharp talons, and feathers a ...
ontology language. Archivo also provides a four star rating scheme for the ontologies it scrapes, based on accessibility, quality, and related fitness‑for‑use criteria. For instance, SHACL compliance for graph‑based data is evaluated when appropriate. Ontologies should also contain metadata about their characteristics and specify a public license describing their terms‑of‑use. the Archivo database contains 1368 entries.
History
DBpedia was initiated in 2007 by Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak and Zachary Ives.
See also
* BabelNet
BabelNet is a multilingual lexical-semantic knowledge graph, ontology and encyclopedic dictionary developed at the NLP group of the Sapienza University of Rome under the supervision of Roberto Navigli.R. Navigli and S. P Ponzetto. 2012BabelNet: ...
* Semantic MediaWiki
* Wikidata
Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, are able to use under the CC0 public domain ...
* YAGO (database)
References
External links
*
{{DEFAULTSORT:Dbpedia
Free software culture and documents
Open data
Semantic Web
Knowledge bases
History of Wikipedia
Java platform
Free software programmed in Scala