How to parse octet-stream files using Apache Tika?

我的梦境 提交于 2019-12-12 03:36:28

问题


I have stored all different types of files on Azure Blob storage, files are txt, doc, pdf,etc. However all the files are stored as 'octet-stream' there and when I open the files to extract the text from them using Tika, Tika cann't detect the character encoding. How can I get around this problem?

FileSystem fs = FileSystem.get(new Configuration());            
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);          
InputStream stream = fs.open(pt);           


AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();   

parser.parse(stream, handler, metadata);       


spaceContentBuffer.append(handler.toString());

回答1:


If you are calling Azure Storage REST API directly, you can set header "x-ms-blob-content-type" via API Set Blob Properties.

If you are using Azure Storage Client Library, you can write similar code as below:

blockBlob.Properties.ContentType = "text/xml";
blockBlob.SetProperties();


来源:https://stackoverflow.com/questions/31011257/how-to-parse-octet-stream-files-using-apache-tika

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!