Getting MimeType subtype with Apache tika

前端未结

关注

 4  967

爱一瞬间的悲伤 2020-12-29 10:35

I\'d need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc.

If you look at mim

4条回答

爱一瞬间的悲伤 (楼主)

2020-12-29 11:16

For anyone else having a similar problem but using newer Tika version this should do the trick:

Use ZipContainerDetector since you may have no ContainerAwareDetector any more.
Give a TikaInputStream to the detect() method of the detector to ensure tika can analyze the correct mime type.

My example code looks like this:

public static String getMimeType(final Document p_document)
{
    try
    {
        Metadata metadata = new Metadata();
        metadata.add(Metadata.RESOURCE_NAME_KEY, p_document.getDocName());

        Detector detector = getDefaultDectector();

        LogMF.debug(log, "Trying to detect mime type with detector {0}.", detector);
        TikaInputStream inputStream = TikaInputStream.get(p_document.getData(), metadata);

        return detector.detect(inputStream, metadata).toString();
    }
    catch (Throwable t)
    {
        log.error("Error while determining mime-type of " + p_document);
    }

    return null;
}

private static Detector getDefaultDectector()
{
    if (detector == null)
    {
        List detectors = new ArrayList<>();

        // zip compressed container types
        detectors.add(new ZipContainerDetector());
        // Microsoft stuff
        detectors.add(new POIFSContainerDetector());
        // mime magic detection as fallback
        detectors.add(MimeTypes.getDefaultMimeTypes());

        detector = new CompositeDetector(detectors);
    }

    return detector;
}

Note that the Document class is part of my domain model. So you will for sure have something similar at that line.

I hope that someone can use this.

0 讨论(0)

查看其它4个回答