apache-tika

Can Solr retain the formatting of the HTML documents whcih was fed to it in its result?

时光毁灭记忆、已成空白 提交于 2019-12-25 09:42:11
问题 How do I maintain the Original formatting of the HTML document in the results given by Solr? I am trying to provide search functionality in one of my companies website that is having millions of documents and all are not having similar formatting, So it is hard to individually format each document. I am using Solr 4.1 nightly builds at apache site which is having inbuilt support for solr-cell and tika. i.e. i need not to separately configure them. does solr-cell or tika retains these

Split Documents into Paragraphs

人走茶凉 提交于 2019-12-25 09:27:08
问题 I have a large stockpile of PDFs of documents. I use Apache Tika to convert them to text, and now I'd like to split them into paragraphs. I can't use regular expressions because the text conversion makes the distinction between paragraphs impossible: some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs (using Tika's conversion to HTML instead of text does not help). Python's NLTK book

Extract file content with ManifoldCF

倾然丶 夕夏残阳落幕 提交于 2019-12-24 16:17:16
问题 I'm trying to use ManifoldCF with the File System Connector. It works like a charm : with the Tika content extractor implemented, I got all the expected metadata from my documents. But... How to configure ManifoldCF in order to get the equivalent of this command : java -jar tika-app-1.9.jar --text I mean, I want to get the CONTENT of the file and pushed it in my Output Connections. How is it possible ? 回答1: You have to set up the transformer in pipeline. Before you configure your output

Tika 1.13 RuntimeException

孤人 提交于 2019-12-24 08:08:38
问题 I recently updated my existing tika project to use tika 1.13 instead of 1.10. The only thing I did was changing the dependency version from 1.10 to 1.13. The project was built successfully. Yet whenever I try and run the application I get this exception: java.lang.RuntimeException: Unable to parse the default media type registry at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:580) at org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:69) at org.apache

how to store file path in Solr when using TikaEntityProcessor

狂风中的少年 提交于 2019-12-24 06:42:57
问题 I am using DIH to index local file system. But the file path, size and lastmodified field were not stored. in the schema.xml I defined: <fields> <field name="title" type="string" indexed="true" stored="true"/> <field name="author" type="string" indexed="true" stored="true" /> <!--<field name="text" type="text" indexed="true" stored="true" /> liang added--> <field name="path" type="string" indexed="true" stored="true" /> <field name="size" type="long" indexed="true" stored="true" /> <field

How to convert doc to docx using Tika or POI?

折月煮酒 提交于 2019-12-24 05:06:07
问题 Can Anyone help me to convert a .doc file to .docx using apache Tika or with apache POI? I tried a lot of ways but stuck in converting the document to docx format. Help is appreciated. Thanks and Regards, Arun R S 回答1: I am doing this recently. And I use aspose words to convert .doc to .docx . Very convenience. Document doc = new Document(filePath); doc.save(descFilePath); Just two lines. 来源: https://stackoverflow.com/questions/23886205/how-to-convert-doc-to-docx-using-tika-or-poi

XPath application using tika parser

梦想的初衷 提交于 2019-12-24 02:43:07
问题 I want to clean an irregular web content - (may be html, pdf image etc) mostly html. I am using tika parser for that. But I dont know how to apply xpath as I use in html cleaner. The code I use is, BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); URL u = new URL("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach- drop-moment-in-drag-and-drop"); new HtmlParser().parse(u.openStream(),handler,

XPath application using tika parser

﹥>﹥吖頭↗ 提交于 2019-12-24 02:43:02
问题 I want to clean an irregular web content - (may be html, pdf image etc) mostly html. I am using tika parser for that. But I dont know how to apply xpath as I use in html cleaner. The code I use is, BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); URL u = new URL("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach- drop-moment-in-drag-and-drop"); new HtmlParser().parse(u.openStream(),handler,

Alfresco community 4.0 doesn't recognize DITA files mimetype

99封情书 提交于 2019-12-24 02:39:09
问题 So I've installed the Community 4.0.a and extended the mimetype list using mimetype-map.xml as I did before in 3.4 <alfresco-config area="mimetype-map"> <config evaluator="string-compare" condition="Mimetype Map"> <mimetypes> <mimetype mimetype="application/dita+xml" text="true" display="DITA"> <extension default="true" display="DITA Topic">dita</extension> <extension default="true" display="DITA Map">ditamap</extension> <extension default="true" display="DITA Conditional Processing Profile"

Extract text from a large pdf with Tika

…衆ロ難τιáo~ 提交于 2019-12-23 03:19:42
问题 I try to extract text from a large pdf, but i only get the first pages, i need all text to will be passed to a string variable. This is the code public class ParsePDF { public static void main(String args[]) throws Exception { try { File file = new File("C:/vlarge.pdf"); String content = new Tika().parseToString(file); System.out.println("The Content: " + content); } catch (Exception e) { e.printStackTrace(); } } } 回答1: From the Javadocs: To avoid unpredictable excess memory use, the returned