The information trapped in text files, PDFs, and other digital content is a valuable information asset that can be very difficult to discover and use. Apache Tika is an open source toolkit that makes it easy for search engines, content management systems and other applications to detect and extract content from digital documents in all major file formats.
"Tika in Action" is a hands-on guide for developers working with search engines, content management systems and other similar applications who want to exploit the information locked in digital documents. It introduces the world of mining text and binary documents as well as other information sources. The book shows where Tika fits within this landscape and how readers can use Tika to build and extend applications. The book''s many case studies give real-world experience from domains ranging from search engines to digital asset management and scientific data processing.