Table Extraction
   HOME

TheInfoList



OR:

Table extraction is the process of recognizing and separating a
table Table may refer to: * Table (database), how the table data arrangement is used within the databases * Table (furniture), a piece of furniture with a flat surface and one or more legs * Table (information), a data arrangement with rows and column ...
from a large document, possibly also recognizing individual rows, columns or elements. It may be regarded as a special form of information extraction. Table extractions from
webpage A web page (or webpage) is a Web document that is accessed in a web browser. A website typically consists of many web pages linked together under a common domain name. The term "web page" is therefore a metaphor of paper pages bound together in ...
s can take advantage of the special
HTML element An HTML element is a type of HTML (HyperText Markup Language) document component, one of several types of HTML nodes (there are also text nodes, comment nodes and others). The first used version of HTML was written by Tim Berners-Lee in 199 ...
s that exist for tables, e.g., the "table" tag, and programming libraries may implement table extraction from webpages. The Python pandas software library can extract tables from HTML webpages via its read_html() function. More challenging is table extraction from
PDF Portable document format (PDF), standardized as ISO 32000, is a file format developed by Adobe Inc., Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, computer hardware, ...
s or scanned images, where there usually is no table-specific machine readable markup. Systems that extract data from tables in scientific PDFs have been described.
Wikipedia Wikipedia is a free content, free Online content, online encyclopedia that is written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and the wiki software MediaWiki. Founded by Jimmy Wales and La ...
presents some of its information in tables, and, e.g., 3.5 million tables can be extracted from the
English Wikipedia The English Wikipedia is the primary English-language edition of Wikipedia, an online encyclopedia. It was created by Jimmy Wales and Larry Sanger on 15 January 2001, as Wikipedia's first edition. English Wikipedia is hosted alongside o ...
. Some of the tables have a specific format, e.g., the so-called
infobox An infobox is a digital or physical Table (information), table used to collect and present a subset of information about its subject, such as a document. It is a structured document containing a set of attribute–value pairs, and in Wikipedia r ...
es. Large-scale table extraction of Wikipedia infoboxes forms one of the sources for
DBpedia DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web using OpenLink Virtuoso. DBpedia a ...
. Commercial
web service A web service (WS) is either: * a service offered by an electronic device to another electronic device, communicating with each other via the Internet, or * a server running on a computer device, listening for requests at a particular port over a n ...
s for table extraction exist, e.g.,
Amazon Amazon most often refers to: * Amazon River, in South America * Amazon rainforest, a rainforest covering most of the Amazon basin * Amazon (company), an American multinational technology company * Amazons, a tribe of female warriors in Greek myth ...
Textract, Google's '' Document AI'',
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
Watson Discovery, and
Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
Form Recognizer. Open source tools also exist, e.g., PDFFigures 2.0 that has been used in
Semantic Scholar Semantic Scholar is a research tool for scientific literature. It is developed at the Allen Institute for AI and was publicly released in November 2015. Semantic Scholar uses modern techniques in natural language processing to support the resear ...
. In a comparison published in 2017, the researchers found the proprietary program ABBYY FineReader to yield the best PDF table extraction performance among six different tools evaluated. In a 2023 benchmark evaluation, Adobe Extract, a cloud-based
API An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...
that employs Adobe’s Sensei AI-platform, performed best among five tools evaluated for table extraction.


References

{{Scholia, topic Natural language processing