How to accurately determine mime data from a file?

后端未结

关注

 3  548

谎友^ 2020-12-05 05:38

I\'m adding some functionality to a program so that I can accurately determine the files type by reading the MIME data. I\'ve already tried a few methods:

Method 1:

3条回答

刺人心 (楼主)

2020-12-05 06:28

So far, the most accurate tool I've found to determine a file's MIME type is Apache Tika. This is a slight modification of what I currently use (with Tika version 1.0)

import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypes;

private static final Detector DETECTOR = new DefaultDetector(
        MimeTypes.getDefaultMimeTypes());

public static String detectMimeType(final File file) throws IOException {
    TikaInputStream tikaIS = null;
    try {
        tikaIS = TikaInputStream.get(file);

        /*
         * You might not want to provide the file's name. If you provide an Excel
         * document with a .xls extension, it will get it correct right away; but
         * if you provide an Excel document with .doc extension, it will guess it
         * to be a Word document
         */
        final Metadata metadata = new Metadata();
        // metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());

        return DETECTOR.detect(tikaIS, metadata).toString();
    } finally {
        if (tikaIS != null) {
            tikaIS.close();
        }
    }
}

Since Tika will use magic numbers, but also look at the contents of files when unsure, the process can be a little time-expensive (it took 3.268 secs for my PC to examine 15 files).

Also, don't make the same mistake I did at first. If you get the tika-core JAR, you should also get the tika-parsers JAR. If you don't get tika-parsers you won't get any exceptions, you will simply not get the MIME type accurately, so it is REALLY important to include it.

An alternative is to get the tika-app JAR, which contains tika-core, tika-parsers and all of the dependencies (they are a lot: poi, poi-ooxml, xmlbeans, commons-compress, just to name a few).

0 讨论(0)

查看其它3个回答