I\'m adding some functionality to a program so that I can accurately determine the files type by reading the MIME data. I\'ve already tried a few methods:
Method 1:
So far, the most accurate tool I've found to determine a file's MIME type is Apache Tika. This is a slight modification of what I currently use (with Tika version 1.0)
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypes;
private static final Detector DETECTOR = new DefaultDetector(
MimeTypes.getDefaultMimeTypes());
public static String detectMimeType(final File file) throws IOException {
TikaInputStream tikaIS = null;
try {
tikaIS = TikaInputStream.get(file);
/*
* You might not want to provide the file's name. If you provide an Excel
* document with a .xls extension, it will get it correct right away; but
* if you provide an Excel document with .doc extension, it will guess it
* to be a Word document
*/
final Metadata metadata = new Metadata();
// metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
return DETECTOR.detect(tikaIS, metadata).toString();
} finally {
if (tikaIS != null) {
tikaIS.close();
}
}
}
Since Tika will use magic numbers, but also look at the contents of files when unsure, the process can be a little time-expensive (it took 3.268 secs for my PC to examine 15 files).
Also, don't make the same mistake I did at first. If you get the tika-core JAR, you should also get the tika-parsers JAR. If you don't get tika-parsers you won't get any exceptions, you will simply not get the MIME type accurately, so it is REALLY important to include it.
An alternative is to get the tika-app JAR, which contains tika-core, tika-parsers and all of the dependencies (they are a lot: poi, poi-ooxml, xmlbeans, commons-compress, just to name a few).