Apache Tika is a content detection and
analysis framework, written in
Java, stewarded at the
Apache Software Foundation
The Apache Software Foundation (ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open source software projects. The ASF was formed from a group of developers of the A ...
. It detects and extracts metadata and text from over a thousand different
file types, and as well as providing a
Java library, has server and command-line editions suitable for use from other programming languages.
History
The project originated as part of the
Apache Nutch codebase, to provide content identification and extraction when
crawling. In 2007, it was separated out, to make it more extensible and usable by
content management systems, other
Web crawlers, and information retrieval systems. The standalone Tika was founded by Jérôme Charron,
Chris Mattmann
Chris Mattmann (born October 29, 1980) is an American data scientist currently working as the Principal Data Scientist and Chief Technology and Innovation Officer in the Office of the Chief Information Officer (OCIO) at the NASA Jet Propulsion L ...
and Jukka Zitting. In 2011 Chris Mattmann and Jukka Zitting released the Manning book "Tika in Action", and the project released version 1.0.
Features
Tika provides capabilities for identification of more than 1400 file types from the
Internet Assigned Numbers Authority taxonomy of
MIME types. For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.
It can also get text from images by using the
OCR software
Tesseract.
While Tika is written in
Java, it is widely used from other languages. The
RESTful server and
CLI Tool permit non-Java programs to access the Tika functionality.
Notable uses
Tika is used by financial institutions including the