How to accurately determine mime data from a file?

后端 未结 3 548
谎友^
谎友^ 2020-12-05 05:38

I\'m adding some functionality to a program so that I can accurately determine the files type by reading the MIME data. I\'ve already tried a few methods:

Method 1:

3条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-05 06:28

    So far, the most accurate tool I've found to determine a file's MIME type is Apache Tika. This is a slight modification of what I currently use (with Tika version 1.0)

    import org.apache.tika.detect.DefaultDetector;
    import org.apache.tika.detect.Detector;
    import org.apache.tika.io.TikaInputStream;
    import org.apache.tika.metadata.Metadata;
    import org.apache.tika.mime.MimeTypes;
    
    private static final Detector DETECTOR = new DefaultDetector(
            MimeTypes.getDefaultMimeTypes());
    
    public static String detectMimeType(final File file) throws IOException {
        TikaInputStream tikaIS = null;
        try {
            tikaIS = TikaInputStream.get(file);
    
            /*
             * You might not want to provide the file's name. If you provide an Excel
             * document with a .xls extension, it will get it correct right away; but
             * if you provide an Excel document with .doc extension, it will guess it
             * to be a Word document
             */
            final Metadata metadata = new Metadata();
            // metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
    
            return DETECTOR.detect(tikaIS, metadata).toString();
        } finally {
            if (tikaIS != null) {
                tikaIS.close();
            }
        }
    }
    

    Since Tika will use magic numbers, but also look at the contents of files when unsure, the process can be a little time-expensive (it took 3.268 secs for my PC to examine 15 files).

    Also, don't make the same mistake I did at first. If you get the tika-core JAR, you should also get the tika-parsers JAR. If you don't get tika-parsers you won't get any exceptions, you will simply not get the MIME type accurately, so it is REALLY important to include it.

    An alternative is to get the tika-app JAR, which contains tika-core, tika-parsers and all of the dependencies (they are a lot: poi, poi-ooxml, xmlbeans, commons-compress, just to name a few).

提交回复
热议问题