Getting MimeType subtype with Apache tika

前端 未结 4 966
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-29 10:35

I\'d need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc.

If you look at mim

4条回答
  •  暖寄归人
    2020-12-29 11:16

    Originally, Tika only supported detection by Mime Magic or by file extension (glob), as this is all most mime detection before Tika did.

    Because of the problems with Mime Magic and globs when it comes to detecting container formats, it was decided to add some new detectors to Tika to handle these. The Container Aware Detectors took the whole file, opened and processed the container, and then worked out the exact file type based on the contents. Initially, you needed to call them explicitly, but then they were wrapped up in ContainerAwareDetector which you'll see in some of the answers.

    Since then, Tika has added a service loader pattern, initially for Parsers. This allowed classes to be auto-loaded when present, with a general way to identify which ones were appropriate and use those. This support was then extended to cover Detectors too, at which point the old ContainerAwareDetector could be removed in favour of something cleaner.

    If you're on Tika 1.2 or later, and you want accurate detection of all formats, including container formats, you want to do something like:

     TikaConfig config = TikaConfig.getDefaultConfig();
     Detector detector = config.getDetector();
    
     TikaInputStream stream = TikaInputStream.get(fileOrStream);
    
     Metadata metadata = new Metadata();
     metadata.add(Metadata.RESOURCE_NAME_KEY, filenameWithExtension);
     MediaType mediaType = detector.detect(stream, metadata);
    

    If you run this with only the Core Tika jar (tika-core-1.2-....), then the only detector present will be the mime magics one, and you'll get the old style detection based on magic + glob only. However, if you run this with both the Core and Parser Tika jars (plus their dependencies), or from Tika App (which includes core + parsers + dependencies automatically), then the DefaultDetector will use all the various different Container Detectors to process your file. If your file is zip based, then detection will include processing the zip structure to identify the file type based on what's in there. This will give you the high accuracy detection you're after, without needing to call lots of different parsers in turn. DefaultDetector will use all Detectors that are available.

提交回复
热议问题