apache-tika

How to properly configure Apache Tika for a few document types?

和自甴很熟 提交于 2019-12-23 03:13:11
问题 I've been using Tika for a while and I know that one is supposed to use only Tika facade with either default or custom TikaConfig that represents org/apache/tika/mime/tika-mimetypes.xml file. My application doesn't allow any document type different than html,doc,docx,odt,txt,rtf,srt,sub,pdf,odf,odp,xls,ppt,msg and the default MediaTypes includes tons of others. Are we supposed to modify tika-mimetypes.xml so that we remove MimeTypes that we don't need ? Then as I understand it will create

Searching attachments from a Rails app (Word, PDF, Excel etc)

半世苍凉 提交于 2019-12-22 08:51:57
问题 My first post to Stack Overflow so be gentle please! I am about to start a new Ruby on Rails (3.1) project for a client. One of their requirements is that there is a search engine, which will be indexing roughly 2,000 documents which are a mixture of PDF, Word, Excel and HTML. I had hoped to use either thinking-sphinx or Texticle (most popular at https://www.ruby-toolbox.com/categories/rails_search.html) but as I understand it: Texticle requires PostgreSQL. I'm on MySQL. thinking-sphinx doesn

get embedded resourses in doc files using apache tika

荒凉一梦 提交于 2019-12-22 08:27:09
问题 I have ms word documents containing text and images. I want to parse them to have xml structure for them. After researching I end up using apache tika for converting my documents. I can parse my doc to xml. here is my code: AutoDetectParser parser=new AutoDetectParser(); InputStream input=new FileInputStream(new File("1.docx")); Metadata metadata = new Metadata(); StringWriter sw = new StringWriter(); SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance();

Handle ligatures in Apache Tika

痞子三分冷 提交于 2019-12-22 05:15:45
问题 Tika doesn't seem to recognize ligatures (fi, ff, fl...) in PDF files and replaces them with question marks. Any idea (not only on Tika) to extract PDF text while converting character ligatures to separated characters ? File file = new File("path/to/file.pdf"); String text = Tika().parseToString(file); Edit My PDF file is UTF-8 encoded (that's what InputStream.getEncoding() says), my platform encoding is also UTF-8. Even with a -Dfile.encoding=UTF8 , it is not working. For instance, I'm

java.lang.IllegalArgumentException: protocol = http host = null

自古美人都是妖i 提交于 2019-12-22 03:58:54
问题 For this link http://bits.blogs.nytimes.com/2014/09/02/uber-banned-across-germany-by-frankfurt-court/?partner=rss&emc=rss this code doesn`t work but if I put another for exemple: https://www.google.com everything is ok: URL url = new URL("http://bits.blogs.nytimes.com/2014/09/02/uber-banned-across-germany-by-frankfurt-court/?partner=rss&emc=rss"); URLConnection uc; uc = url.openConnection(); uc.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7; en-US)

How to read large files using TIka?

穿精又带淫゛_ 提交于 2019-12-21 07:12:01
问题 I'm parsing large pdf and word documents using Tika but I get he followiing error message. Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). How can I increase the limit? 回答1: Assuming you're basically following the Tika example for extracting to plain text, then all you need to do is create your BodyContentHandler with a write limit of

how to extract main text from html using Tika

假装没事ソ 提交于 2019-12-21 05:11:26
问题 I just want to know that how i can extract main text and plain text from html using Tika? maybe one possible solution is to use BoilerPipeContentHandler but do you have some sample/demo codes to show it? thanks very much in advance 回答1: Here is a sample: public String[] tika_autoParser() { String[] result = new String[3]; try { InputStream input = new FileInputStream(new File("/Users/nazanin/Books/Web crawler.pdf")); ContentHandler textHandler = new BodyContentHandler(); Metadata metadata =

How can I use the HTML parser with Apache Tika in Java to extract all HTML tags?

孤街浪徒 提交于 2019-12-20 10:10:52
问题 I download tika-core and tika-parser libraries, but I could not find the example codes to parse HTML documents to string. I have to get rid of all html tags of source of a web page. What can I do? How do I code that using Apache Tika? 回答1: Do you want a plain text version of a html file? If so, all you need is something like: InputStream input = new FileInputStream("myfile.html"); ContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); new HtmlParser().parse

Apache Tika how to extract html body with out header and footer content

旧巷老猫 提交于 2019-12-20 03:29:27
问题 I am looking to extract entire body content of html except header and footer, however I am getting exception org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared Below is my code that i have created as mentioned at import org.apache.tika.exception.TikaException; import org.apache.tika.io.TikaInputStream; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.sax.ToHTMLContentHandler; import org.apache.tika

“java.lang.SecurityException: Prohibited package name: java.sql” error happen only when executing outside of Eclipse

北城余情 提交于 2019-12-20 01:45:54
问题 I am writing a Topic Modeling program using Apache Tika to extract the text contents from other file type. Actually It run perfectly on Eclipse. But when I export to JAR file to use from command prompt of the Window 10. This error message appear when it try the code: "parser.parse(stream, handler, metadata, parseContext);" "java.lang.SecurityException: Prohibited package name: java.sql" I didn't upload my java code here because I don't think they are the root of the problem. Since it run