I recently discovered a useful library called Apache Tika that makes it easy to extract metadata information from many types of files.
The ECM Alfresco makes use of Apache Tika for both metadata extraction and content transformation.
With Apache Tika, you do not have to worry about which parser to use with a type of file. Apache Tika will look for a parser implementation that matches the type of the document, once it is known, using Mime Type detection.
Here is a basic usage of the library to extract metadata information from files such as documents (PDF/DOC/XLS), images (JPG), songs (MP3).
You can start from a maven archetype such as quickstart. Then all you need is to add the following two dependencies :
<dependencies> ... <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.0</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.0</version> </dependency> </dependencies>
The org.apache.tika.parser.AutoDetectParser class is in charge of dispatching the incoming document to the appropriate parser. It is especially useful when the type of the document is not known in advance.
package net.celinio.tika.firstProject; import java.io.File; import java.io.FileInputStream; import java.io.InputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.sax.BodyContentHandler; public class MetaDataExtraction { public static void main(String[] args) { try { //String resourceLocation = "d:\\tempTika\\TikainAction.pdf"; //String resourceLocation = "d:\\tempTika\\06-takefive.mp3"; String resourceLocation = "d:\\tempTika\\mariniere14juillet2011.jpg"; //String resourceLocation = "d:\\tempTika\\02b-blank-timetable.doc"; //String resourceLocation = "d:\\tempTika\\examstudytable.doc"; //String resourceLocation = "d:\\tempTika\\timetable.xls"; File file = new File(resourceLocation); InputStream input = new FileInputStream(file); System.out.println( file.getPath()); Metadata metadata = new Metadata(); BodyContentHandler handler = new BodyContentHandler(10*1024*1024); AutoDetectParser parser = new AutoDetectParser(); parser.parse(input, handler, metadata); /* String content = new Tika().parseToString(f); //System.out.println("Content: " + content); //System.out.println("Content: " + handler.toString()); System.out.println("Title: " + metadata.get(Metadata.TITLE)); System.out.println("Last author: " + metadata.get(Metadata.LAST_AUTHOR)); System.out.println("Last modified: " + metadata.get(Metadata.LAST_MODIFIED)); System.out.println("Content type: " + metadata.get(Metadata.CONTENT_TYPE)); System.out.println("Application name: " + metadata.get(Metadata.APPLICATION_NAME)); System.out.println("Author: " + metadata.get(Metadata.AUTHOR)); System.out.println("Line count: " + metadata.get(Metadata.LINE_COUNT)); System.out.println("Word count: " + metadata.get(Metadata.WORD_COUNT)); System.out.println("Page count: " + metadata.get(Metadata.PAGE_COUNT)); System.out.println("MIME_TYPE_MAGIC: " + metadata.get(Metadata.MIME_TYPE_MAGIC)); System.out.println("SUBJECT: " + metadata.get(Metadata.SUBJECT)); */ String[] metadataNames = metadata.names(); // Display all metadata for(String name : metadataNames){ System.out.println(name + ": " + metadata.get(name)); } } catch (Exception e) { e.printStackTrace(); } } }
Line 30, I am using the BodyContentHandler constructor that takes an argument because i need to increase the size limit. Otherwise the WriteLimitReachedException exception is raised when parsing the file TikainAction.pdf (16,4 MB) :
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
Here is the output for the image file mariniere14juillet2011.jpg :
d:\tempTika\mariniere14juillet2011.jpg Number of Components: 3 Windows XP Title: Popo Date/Time Original: 2011:07:14 14:16:10 Image Height: 600 pixels Image Description: Popo Data Precision: 8 bits Sub-Sec Time Digitized: 31 tiff:BitsPerSample: 8 Windows XP Subject: Moules date: 2011-07-14T14:16:10 exif:DateTimeOriginal: 2011-07-14T14:16:10 Component 1: Y component: Quantization table 0, Sampling factors 2 horiz/2 vert tiff:ImageLength: 600 Component 2: Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert Component 3: Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert Date/Time Digitized: 2011:07:14 14:16:10 description: Popo tiff:ImageWidth: 800 Unknown tag (0xea1c): 28 -22 Image Width: 800 pixels Sub-Sec Time Original: 31 Content-Type: image/jpeg Artist: Popo;Cel Windows XP Author: Popo;Cel
And the output for the song file 06-takefive.mp3 :
d:\tempTika\06-takefive.mp3 xmpDM:releaseDate: null xmpDM:audioChannelType: Stereo xmpDM:album: Take Five Author: Dave Brubeck xmpDM:artist: Dave Brubeck channels: 2 xmpDM:audioSampleRate: 44100 xmpDM:logComment: null xmpDM:trackNumber: 6/8 version: MPEG 3 Layer III Version 1 xmpDM:composer: null xmpDM:audioCompressor: MP3 title: Take Five samplerate: 44100 xmpDM:genre: null Content-Type: audio/mpeg
And the output for the ebook TikainAction.pdf :
d:\tempTika\TikainAction.pdf xmpTPg:NPages: 257 Creation-Date: 2011-11-09T12:20:20Z title: Tika in Action created: Wed Nov 09 13:20:20 CET 2011 Licensed to: Celinio Fernandes <xxx@yyy.com> Last-Modified: 2011-11-16T12:25:00Z producer: Acrobat Distiller 9.4.6 (Windows) Author: Chris A. Mattmann, Jukka L. Zitting Content-Type: application/pdf creator: FrameMaker 8.0
And the output for the Word document 02b-blank-timetable.doc :
d:\tempTika\02b-blank-timetable.doc Revision-Number: 4 Comments: Last-Author: CeLTS Template: Normal.dot Page-Count: 1 subject: Application-Name: Microsoft Office Word Author: CeLTS Word-Count: 1921 xmpTPg:NPages: 1 Edit-Time: 3600000000 Creation-Date: 2006-02-09T00:31:00Z title: Study Timetable Character Count: 10951 Company: Monash University Content-Type: application/msword Keywords: Last-Save-Date: 2006-10-30T05:52:00Z
As you can see, the list of metadata information (title, author, image height, etc) is varying, depending on which parser is used and of course which type of document it is.
You can also search the content of the files as Apache Tika provides access to the textual content of files.
By the way, there is a Tika GUI which is a handy tool that makes it possible to extract metadata information by simply drag and dropping a file into it.
To launch it, just download the jar tika-app-1.0.jar and run it :
java -jar tika-app-1.0.jar --gui
Drag and drop a file into it and read the extracted metadata :
Links :