apache-tika

get embedded resourses in doc files using apache tika

核能气质少年 提交于 2019-12-05 16:01:28
I have ms word documents containing text and images. I want to parse them to have xml structure for them. After researching I end up using apache tika for converting my documents. I can parse my doc to xml. here is my code: AutoDetectParser parser=new AutoDetectParser(); InputStream input=new FileInputStream(new File("1.docx")); Metadata metadata = new Metadata(); StringWriter sw = new StringWriter(); SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance(); TransformerHandler handler = factory.newTransformerHandler(); handler.getTransformer()

Apache Tika and character limit when parsing documents

元气小坏坏 提交于 2019-12-05 12:35:06
问题 Could please anybody help me to sort it out? It can be done like this Tika tika = new Tika(); tika.setMaxStringLength(10*1024*1024); But if you don't use Tika directly, like this: ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ParseContext ps = new ParseContext(); for (InputStream is : getInputStreams()) { parser.parse(is, textHandler, metadata, ps); is.close(); System.out.println("Title: " + metadata.get(

Is Apache Tika able to extract foreign languages like Chinese, Japanese?

蓝咒 提交于 2019-12-05 11:04:10
Is Apache Tika able to extract foreign languages like Chinese, Japanese? I have the following code: Detector detector = new DefaultDetector(); Parser parser = new AutoDetectParser(detector); InputStream stream = new ByteArrayInputStream(bytes); OutputStream outputstream = new ByteArrayOutputStream(); ContentHandler textHandler = new BodyContentHandler(outputstream); Metadata metadata = new Metadata(); // Set<String> langs = LanguageIdentifier.getSupportedLanguages(); // metadata.set(Metadata.CONTENT_LANGUAGE, lang); // metadata.set(Metadata.FORMAT, hint); ParseContext context = new

How to create Custom model using OpenNLP?

寵の児 提交于 2019-12-05 02:46:23
问题 I am trying to extract entities like Names, Skills from document using OpenNLP Java API . but it is not extracting proper Names . I am using model available on opennlp sourceforge link Here is a piece of java code- public class tikaOpenIntro { public static void main(String[] args) throws IOException, SAXException, TikaException { tikaOpenIntro toi = new tikaOpenIntro(); toi.filest(""); String cnt = toi.contentEx(); toi.sentenceD(cnt); toi.tokenization(cnt); String names = toi.namefind(toi

java.lang.IllegalArgumentException: protocol = http host = null

混江龙づ霸主 提交于 2019-12-05 01:11:57
For this link http://bits.blogs.nytimes.com/2014/09/02/uber-banned-across-germany-by-frankfurt-court/?partner=rss&emc=rss this code doesn`t work but if I put another for exemple: https://www.google.com everything is ok: URL url = new URL("http://bits.blogs.nytimes.com/2014/09/02/uber-banned-across-germany-by-frankfurt-court/?partner=rss&emc=rss"); URLConnection uc; uc = url.openConnection(); uc.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.205 Safari/534.16"); uc.addRequestProperty("referer",

Solr : data import handler and solr cell

北战南征 提交于 2019-12-04 22:55:52
Is it possible to index rich document (pdf, office)... with data import handler using solr cell. I use solr 3.2. Thanks. Solr Cell, aka ExtractingRequestHandler , uses Apache Tika behind the scenes, and the latter can easily be integrated into a DataImportHandler: <dataConfig> <!-- use any of type DataSource<InputStream> --> <dataSource type="BinURLDataSource"/> <document> <!-- The value of format can be text|xml|html|none. this is the format in which the body is emited (the 'text' field) . The implicit field 'text' will have that format. default value is 'text' (if not specified) . format=

Retrieving extracted text with Apache Solr

好久不见. 提交于 2019-12-04 16:01:36
I'm new to Apache Solr, and I want to use it for indexing pdf files. I managed to get it up and running so far and I can now search for added pdf files. However, I need to be able to retrieve the searched text from the results. I found an xml snippet in the default solrconfig.xml concerning exactly that: <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy"> <lst name="defaults"> <!-- All the main content goes into "text"... if you need to return the extracted text or do highlighting, use a stored field. --> <str name="fmap

How to index text files using apache solr

假如想象 提交于 2019-12-04 14:18:22
问题 I wanted to index text files. After searching a lot I got to know about Apache tika. Now in some sites where I studied Apache tika, I got to know that Apache tika converts the text it into XML format and then sends it to solr. But while converting it creates only one tag example ....... Now the text file I wish to index is a tomcat local host access file. This file is in GB's. I cannot store it and a single index. I want each line to have line-id ....... So that i can easily retrieve the

How to read large files using TIka?

自作多情 提交于 2019-12-03 23:49:50
I'm parsing large pdf and word documents using Tika but I get he followiing error message. Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). How can I increase the limit? Assuming you're basically following the Tika example for extracting to plain text , then all you need to do is create your BodyContentHandler with a write limit of -1 to disable the write limit, as explained in the javadocs Your code would then look something like (

Index pdf file content using Apache Solr

核能气质少年 提交于 2019-12-03 21:49:39
问题 I'm using Solr's php extension for interacting with Apache Solr. I'm indexing data from the database. I wanted to index contents of external files (like PDFs, PPTX) as well. The logic for indexing is: Suppose the schema.xml has the following fields defined: <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="created" type="tlong" indexed="true" stored="true" /> <field name="name" type="text_general" indexed="true" stored="true"/>