December 2011 – Celinio's technical blog : learning and adapting

I recently discovered a useful library called Apache Tika that makes it easy to extract metadata information from many types of files.
The ECM Alfresco makes use of Apache Tika for both metadata extraction and content transformation.
With Apache Tika, you do not have to worry about which parser to use with a type of file. Apache Tika will look for a parser implementation that matches the type of the document, once it is known, using Mime Type detection.
Here is a basic usage of the library to extract metadata information from files such as documents (PDF/DOC/XLS), images (JPG), songs (MP3).

You can start from a maven archetype such as quickstart. Then all you need is to add the following two dependencies :

<dependencies>
...
 <dependency>
	            <groupId>org.apache.tika</groupId>
	            <artifactId>tika-core</artifactId>
	            <version>1.0</version>
 </dependency>
 <dependency>
	            <groupId>org.apache.tika</groupId>
	            <artifactId>tika-parsers</artifactId>
	            <version>1.0</version>
 </dependency>
</dependencies>

The org.apache.tika.parser.AutoDetectParser class is in charge of dispatching the incoming document to the appropriate parser. It is especially useful when the type of the document is not known in advance.

package net.celinio.tika.firstProject;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;

import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;

public class MetaDataExtraction {

	public static void main(String[] args) {
		 
		try {
			//String resourceLocation = "d:\\tempTika\\TikainAction.pdf";
			//String resourceLocation = "d:\\tempTika\\06-takefive.mp3";			
			String resourceLocation = "d:\\tempTika\\mariniere14juillet2011.jpg";
			//String resourceLocation = "d:\\tempTika\\02b-blank-timetable.doc";
			//String resourceLocation = "d:\\tempTika\\examstudytable.doc";
			//String resourceLocation = "d:\\tempTika\\timetable.xls";
			
			File file = new File(resourceLocation);
			 
			InputStream input = new FileInputStream(file);			 
			System.out.println( file.getPath());				
			
			Metadata metadata = new Metadata();
			 
			BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
			AutoDetectParser parser = new AutoDetectParser();		
	
			parser.parse(input, handler, metadata);
			 /*
			String content = new Tika().parseToString(f);
			//System.out.println("Content: " + content);
			//System.out.println("Content: " + handler.toString());
			System.out.println("Title: " + metadata.get(Metadata.TITLE));
			System.out.println("Last author: " + metadata.get(Metadata.LAST_AUTHOR));
			System.out.println("Last modified: " + metadata.get(Metadata.LAST_MODIFIED));
			System.out.println("Content type: " + metadata.get(Metadata.CONTENT_TYPE));
			System.out.println("Application name: " + metadata.get(Metadata.APPLICATION_NAME));
			System.out.println("Author: " + metadata.get(Metadata.AUTHOR));
			System.out.println("Line count: " + metadata.get(Metadata.LINE_COUNT));
			System.out.println("Word count: " + metadata.get(Metadata.WORD_COUNT));
			System.out.println("Page count: " + metadata.get(Metadata.PAGE_COUNT));
			System.out.println("MIME_TYPE_MAGIC: " + metadata.get(Metadata.MIME_TYPE_MAGIC));
			System.out.println("SUBJECT: " + metadata.get(Metadata.SUBJECT));
			
			*/

			String[] metadataNames = metadata.names();
			
			// Display all metadata
			for(String name : metadataNames){
				System.out.println(name + ": " + metadata.get(name));
			}
			
			}
			catch (Exception e) {
				e.printStackTrace();
			}
			 
	}
}

Line 30, I am using the BodyContentHandler constructor that takes an argument because i need to increase the size limit. Otherwise the WriteLimitReachedException exception is raised when parsing the file TikainAction.pdf (16,4 MB) :

org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: 
Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).

Here is the output for the image file mariniere14juillet2011.jpg :

d:\tempTika\mariniere14juillet2011.jpg
Number of Components: 3
Windows XP Title: Popo
Date/Time Original: 2011:07:14 14:16:10
Image Height: 600 pixels
Image Description: Popo
Data Precision: 8 bits
Sub-Sec Time Digitized: 31
tiff:BitsPerSample: 8
Windows XP Subject: Moules
date: 2011-07-14T14:16:10
exif:DateTimeOriginal: 2011-07-14T14:16:10
Component 1: Y component: Quantization table 0, Sampling factors 2 horiz/2 vert
tiff:ImageLength: 600
Component 2: Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert
Component 3: Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert
Date/Time Digitized: 2011:07:14 14:16:10
description: Popo
tiff:ImageWidth: 800
Unknown tag (0xea1c): 28 -22
Image Width: 800 pixels
Sub-Sec Time Original: 31
Content-Type: image/jpeg
Artist: Popo;Cel
Windows XP Author: Popo;Cel

And the output for the song file 06-takefive.mp3 :

d:\tempTika\06-takefive.mp3
xmpDM:releaseDate: null
xmpDM:audioChannelType: Stereo
xmpDM:album: Take Five
Author: Dave Brubeck
xmpDM:artist: Dave Brubeck
channels: 2
xmpDM:audioSampleRate: 44100
xmpDM:logComment: null
xmpDM:trackNumber: 6/8
version: MPEG 3 Layer III Version 1
xmpDM:composer: null
xmpDM:audioCompressor: MP3
title: Take Five
samplerate: 44100
xmpDM:genre: null
Content-Type: audio/mpeg

And the output for the ebook TikainAction.pdf :

d:\tempTika\TikainAction.pdf
xmpTPg:NPages: 257
Creation-Date: 2011-11-09T12:20:20Z
title: Tika in Action
created: Wed Nov 09 13:20:20 CET 2011
Licensed to: Celinio Fernandes  <xxx@yyy.com>
Last-Modified: 2011-11-16T12:25:00Z
producer: Acrobat Distiller 9.4.6 (Windows)
Author: Chris A. Mattmann, Jukka L. Zitting
Content-Type: application/pdf
creator: FrameMaker 8.0

And the output for the Word document 02b-blank-timetable.doc :

d:\tempTika\02b-blank-timetable.doc
Revision-Number: 4
Comments: 
Last-Author: CeLTS
Template: Normal.dot
Page-Count: 1
subject: 
Application-Name: Microsoft Office Word
Author: CeLTS
Word-Count: 1921
xmpTPg:NPages: 1
Edit-Time: 3600000000
Creation-Date: 2006-02-09T00:31:00Z
title: Study Timetable
Character Count: 10951
Company: Monash University
Content-Type: application/msword
Keywords: 
Last-Save-Date: 2006-10-30T05:52:00Z

As you can see, the list of metadata information (title, author, image height, etc) is varying, depending on which parser is used and of course which type of document it is.
You can also search the content of the files as Apache Tika provides access to the textual content of files.
By the way, there is a Tika GUI which is a handy tool that makes it possible to extract metadata information by simply drag and dropping a file into it.
To launch it, just download the jar tika-app-1.0.jar and run it :

java -jar tika-app-1.0.jar --gui

Drag and drop a file into it and read the extracted metadata :

Links :

http://tika.apache.org/
The book Tika in Action (Manning)