Apache Tika
   HOME

TheInfoList



OR:

Apache Tika is a content detection and
analysis Analysis (: analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (38 ...
framework, written in
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
, stewarded at the
Apache Software Foundation The Apache Software Foundation ( ; ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open-source software projects. The ASF was formed from a group of developers of the ...
. It detects and extracts metadata and text from over a thousand different
file type A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or open. Some file formats ...
s, and as well as providing a
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
library, has server and command-line editions suitable for use from other programming languages.


History

The project originated as part of the
Apache Nutch Apache Nutch is a highly extensible and scalable open source web crawler software project. Features Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architec ...
codebase, to provide content identification and extraction when crawling. In 2007, it was separated out, to make it more extensible and usable by
content management systems A content management system (CMS) is computer software used to manage the creation and modification of digital content (content management).''Managing Enterprise Content: A Unified Content Strategy''. Ann Rockley, Pamela Kostur, Steve Manning. New ...
, other
Web crawlers Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...
, and information retrieval systems. The standalone Tika was founded by Jérôme Charron,
Chris Mattmann Chris Mattmann (born October 29, 1980) is an American data scientist currently working as the Principal Data Scientist and Chief Technology and Innovation Officer in the Office of the Chief Information Officer (OCIO) at the NASA Jet Propulsion L ...
and Jukka Zitting. In 2011 Chris Mattmann and Jukka Zitting released the Manning book "Tika in Action", and the project released version 1.0.


Features

Tika provides capabilities for identification of more than 1400 file types from the
Internet Assigned Numbers Authority The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, Autonomous system (Internet), autonomous system number allocation, DNS root zone, root zone management in the Domain Name Syste ...
taxonomy of MIME types. For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities. It can also get text from images by using the OCR software
Tesseract In geometry, a tesseract or 4-cube is a four-dimensional hypercube, analogous to a two-dimensional square and a three-dimensional cube. Just as the perimeter of the square consists of four edges and the surface of the cube consists of six ...
. While Tika is written in
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
, it is widely used from other languages. The
RESTful REST (Representational State Transfer) is a software architectural style that was created to describe the design and guide the development of the architecture for the World Wide Web. REST defines a set of constraints for how the architecture of ...
server and CLI Tool permit non-Java programs to access the Tika functionality.


Notable uses

Tika is used by financial institutions including the Fair Isaac Corporation (FICO), Goldman Sachs,
NASA The National Aeronautics and Space Administration (NASA ) is an independent agencies of the United States government, independent agency of the federal government of the United States, US federal government responsible for the United States ...
and academic researchers and by major content management systems including
Drupal Drupal () is a free and open-source web content management system (CMS) written in PHP and distributed under the GNU General Public License. Drupal provides an open-source back-end framework for at least 14% of the top 10,000 websites worldwide ...
, and
Alfresco (software) Alfresco Software is a collection of information management software products for Microsoft Windows and Unix-like operating systems developed by Alfresco Software Inc. using Java platform, Java technology. The software, branded as a Digital Busin ...
to analyze large amounts of content, and to make it available in common formats using information retrieval techniques. On April 4, 2016
Forbes ''Forbes'' () is an American business magazine founded by B. C. Forbes in 1917. It has been owned by the Hong Kong–based investment group Integrated Whale Media Investments since 2014. Its chairman and editor-in-chief is Steve Forbes. The co ...
published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshore
shell corporation A shell corporation is a company or corporation with no significant assets or operations often formed to obtain financing before beginning business. Shell companies were primarily vehicles for lawfully hiding the identity of their beneficial ...
s. The leaked documents and the project to analyze them is referred to as the
Panama Papers The Panama Papers () are 11.5 million leaked documents (or 2.6 terabytes of data) published beginning April 3, 2016. The papers detail financial and attorney–client information for more than 214,488 offshore entities. These document ...
.


See also

* Magic number


References

{{Apache Software Foundation Tika Java platform Free software programmed in Java (programming language) Java (programming language) libraries Software using the Apache license