Define a MIME type for .TXT files for Tika

浪尽此生 提交于 2019-12-13 20:15:21

问题


I want to define the MIME type of *.txt files: text/txt, so that Tika can apply a more specific parser than the one used for text/plain files.

The glob *.txt is included in the definition of the type text/plain in tika-mimetypes.xml. Moreover, it seems to me that you cannot redefine a MIME type in custom-mimetypes.xml, only add new globs or magic patterns. Additionally, if I define the text/txt type in tika-mimetypes.xml as a subtype of text/plain with only the glob *.txt, Tika still detects a txt file as text/plain.

Is it absurd to define a subtype of text/plain only for txt files? If not, is it possible to define it only with custom-mimetypes.xml? If not, what is the easiest way to extend tika so that it can parse txt files differently than (let's say) STEP 3D CAD .stp files or .cfg files?

The use case in detail: I have a large source of data composed of (recursive) archives. Some plain text files are huge and I don't want Tika to parse them. However, I want to keep all the txt files.

Edit: specify that I don't want to keep .cfg files either (*.cfg is a glob of text/plain)

来源:https://stackoverflow.com/questions/48411421/define-a-mime-type-for-txt-files-for-tika

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!