I\'m adding some functionality to a program so that I can accurately determine the files type by reading the MIME data. I\'ve already tried a few methods:
Method 1:
I'm not entirely sure how accurate it is, but this worked for me in simple cases.
FileNameMap fileNameMap = URLConnection.getFileNameMap();
String type = fileNameMap.getContentTypeFor(filePath);
As mentioned in the comments since there's so many possible file types it could be hit and miss for ALL possibile files, but you probably know the types of files you are typically going to be dealing with. This excellent list of magic numbers has helped me do detection recently around the specific office formats you mentioned (search for Microsoft Office) and you'll see that the MS office file types have a sub-type specified (which is further into the file) and lets you work out specifically which type of file you have. Many new formats like ODT, DOCX, OOXML etc use a ZIP file to hold their data so you might need to detect zip first, then look for specifics.
So far, the most accurate tool I've found to determine a file's MIME type is Apache Tika. This is a slight modification of what I currently use (with Tika version 1.0)
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypes;
private static final Detector DETECTOR = new DefaultDetector(
MimeTypes.getDefaultMimeTypes());
public static String detectMimeType(final File file) throws IOException {
TikaInputStream tikaIS = null;
try {
tikaIS = TikaInputStream.get(file);
/*
* You might not want to provide the file's name. If you provide an Excel
* document with a .xls extension, it will get it correct right away; but
* if you provide an Excel document with .doc extension, it will guess it
* to be a Word document
*/
final Metadata metadata = new Metadata();
// metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
return DETECTOR.detect(tikaIS, metadata).toString();
} finally {
if (tikaIS != null) {
tikaIS.close();
}
}
}
Since Tika will use magic numbers, but also look at the contents of files when unsure, the process can be a little time-expensive (it took 3.268 secs for my PC to examine 15 files).
Also, don't make the same mistake I did at first. If you get the tika-core JAR, you should also get the tika-parsers JAR. If you don't get tika-parsers you won't get any exceptions, you will simply not get the MIME type accurately, so it is REALLY important to include it.
An alternative is to get the tika-app JAR, which contains tika-core, tika-parsers and all of the dependencies (they are a lot: poi, poi-ooxml, xmlbeans, commons-compress, just to name a few).