Tika
DeveloperApache Software Foundation
Stable release
3.2.3[1] Edit this on Wikidata / 9 September 2025; 9 months ago (9 September 2025)
Written inJava
Operating systemCross-platform
TypeSearch and index API
LicenseApache License 2.0
Websitetika.apache.org
RepositoryTika Repository

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation.[2] It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages.

History

edit

The project originated as part of the Apache Nutch codebase, to provide content identification and extraction when crawling. In 2007, it was separated out, to make it more extensible and usable by content management systems, other Web crawlers, and information retrieval systems. The standalone Tika was founded by Jérôme Charron, Chris Mattmann and Jukka Zitting.[3] In 2011 Chris Mattmann and Jukka Zitting released the Manning book "Tika in Action", and the project released version 1.0.

Features

edit

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types. For most of the more common and popular formats,[4] Tika then provides content extraction, metadata extraction and language identification capabilities.

It can also get text from images by using the OCR software Tesseract.[5]

While Tika is written in Java, it is widely used from other languages.[6] The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

Notable uses

edit

Tika is used by financial institutions including the Fair Isaac Corporation (FICO),[7] Goldman Sachs,[8] NASA and academic researchers[9] and by major content management systems including Drupal,[10] and Alfresco (software)[11] to analyze large amounts of content, and to make it available in common formats using information retrieval techniques.

On April 4, 2016[12] Forbes published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshore shell corporations. The leaked documents and the project to analyze them is referred to as the Panama Papers.

See also

edit

References

edit
  1. ^ https://dist.apache.org/repos/dist/release/tika/3.2.3/CHANGES-3.2.3.txt. Retrieved 22 February 2026. {{cite web}}: Missing or empty |title= (help)
  2. ^ "Apache Tika". Retrieved 2016-04-15.
  3. ^ "Tika Proposal". Retrieved 2016-04-15.
  4. ^ "The Apache Software Foundation". Apache Tika formats page. Retrieved 16 April 2016.
  5. ^ "TikaOCR". Apache Tika. 2019-03-26. Retrieved 2019-12-02.
  6. ^ "API Bindings for Tika". Apache Tika. Retrieved 2016-04-17.
  7. ^ "FICO to Engage Kaggle's Community of 180,000 Data Scientists to Drive Innovation in the FICO Analytic Cloud | FICO". FICO | Decisions. Archived from the original on 2016-06-03. Retrieved 2016-04-15.
  8. ^ "Goldman Sachs Puts Elasticsearch To Work - InformationWeek". InformationWeek. Retrieved 2017-06-21.
  9. ^ "Studying polar data with the help of Apache Tika". Opensource.com. Retrieved 2016-04-15.
  10. ^ "Text Extract for Drupal using Tika | Drupal.org". www.drupal.org. 30 July 2012. Retrieved 2016-04-15.
  11. ^ "Content Transformation and Metadata Extraction with Apache Tika - alfrescowiki". wiki.alfresco.com. 5 June 2015. Retrieved 2016-04-15.
  12. ^ Fox-Brewster, Thomas. "From Encrypted Drives To Amazon's Cloud -- The Amazing Flight Of The Panama Papers". Forbes. Retrieved 2016-04-15.

📚 Artikel Terkait di Wikipedia

Chris Mattmann

such position in the University of California system. He co-invented Apache Tika, a widely used open-source content analysis framework that was instrumental

Tika

Estonia, village in Võru County Apache Tika, content analysis software Tika (dog) (2010/2011–2025), Italian Greyhound Tika Waylan, a character in the DragonLance

Apache Nutch

Apache Nutch is a highly extensible and scalable open source web crawler software project. Nutch is coded entirely in the Java programming language, but

Outline of the Java programming language

Java Edition NetBeans Apache Software Foundation – Apache Commons, Apache Maven, Apache Tomcat, Apache Kafka Eclipse Foundation – Adoptium, Eclipse IDE

StormCrawler

for instance spout and bolts for Elasticsearch and Apache Solr or a ParserBolt which uses Apache Tika to parse various document formats. The project is

Apache Lucene

such as Lucene.NET, Mahout, Tika and Nutch. These three are now independent top-level projects. In March 2010, the Apache Solr search server joined as

Apache Commons

The Apache Commons is a project of the Apache Software Foundation, formerly under the Jakarta Project. The purpose of the Commons is to provide reusable

Java (programming language)

features, offering an implementation compatible with the standard library (Apache Harmony). The use of Java-related technology in Android led to a legal dispute