apache-tika | 易学教程

get embedded resourses in doc files using apache tika

阅读更多关于 get embedded resourses in doc files using apache tika

I have ms word documents containing text and images. I want to parse them to have xml structure for them. After researching I end up using apache tika for converting my documents. I can parse my doc to xml. here is my code: AutoDetectParser parser=new AutoDetectParser(); InputStream input=new FileInputStream(new File("1.docx")); Metadata metadata = new Metadata(); StringWriter sw = new StringWriter(); SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance(); TransformerHandler handler = factory.newTransformerHandler(); handler.getTransformer()

Apache Tika and character limit when parsing documents

阅读更多关于 Apache Tika and character limit when parsing documents

问题 Could please anybody help me to sort it out? It can be done like this Tika tika = new Tika(); tika.setMaxStringLength(10*1024*1024); But if you don't use Tika directly, like this: ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ParseContext ps = new ParseContext(); for (InputStream is : getInputStreams()) { parser.parse(is, textHandler, metadata, ps); is.close(); System.out.println("Title: " + metadata.get(

Is Apache Tika able to extract foreign languages like Chinese, Japanese?

阅读更多关于 Is Apache Tika able to extract foreign languages like Chinese, Japanese?

Is Apache Tika able to extract foreign languages like Chinese, Japanese? I have the following code: Detector detector = new DefaultDetector(); Parser parser = new AutoDetectParser(detector); InputStream stream = new ByteArrayInputStream(bytes); OutputStream outputstream = new ByteArrayOutputStream(); ContentHandler textHandler = new BodyContentHandler(outputstream); Metadata metadata = new Metadata(); // Set<String> langs = LanguageIdentifier.getSupportedLanguages(); // metadata.set(Metadata.CONTENT_LANGUAGE, lang); // metadata.set(Metadata.FORMAT, hint); ParseContext context = new

How to create Custom model using OpenNLP?

阅读更多关于 How to create Custom model using OpenNLP?

问题 I am trying to extract entities like Names, Skills from document using OpenNLP Java API . but it is not extracting proper Names . I am using model available on opennlp sourceforge link Here is a piece of java code- public class tikaOpenIntro { public static void main(String[] args) throws IOException, SAXException, TikaException { tikaOpenIntro toi = new tikaOpenIntro(); toi.filest(""); String cnt = toi.contentEx(); toi.sentenceD(cnt); toi.tokenization(cnt); String names = toi.namefind(toi

java.lang.IllegalArgumentException: protocol = http host = null

阅读更多关于 java.lang.IllegalArgumentException: protocol = http host = null

For this link http://bits.blogs.nytimes.com/2014/09/02/uber-banned-across-germany-by-frankfurt-court/?partner=rss&emc=rss this code doesn`t work but if I put another for exemple: https://www.google.com everything is ok: URL url = new URL("http://bits.blogs.nytimes.com/2014/09/02/uber-banned-across-germany-by-frankfurt-court/?partner=rss&emc=rss"); URLConnection uc; uc = url.openConnection(); uc.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.205 Safari/534.16"); uc.addRequestProperty("referer",

Solr : data import handler and solr cell

阅读更多关于 Solr : data import handler and solr cell

Is it possible to index rich document (pdf, office)... with data import handler using solr cell. I use solr 3.2. Thanks. Solr Cell, aka ExtractingRequestHandler , uses Apache Tika behind the scenes, and the latter can easily be integrated into a DataImportHandler: <dataConfig>  <dataSource type="BinURLDataSource"/> <document> <!-- The value of format can be text|xml|html|none. this is the format in which the body is emited (the 'text' field) . The implicit field 'text' will have that format. default value is 'text' (if not specified) . format=

Retrieving extracted text with Apache Solr

阅读更多关于 Retrieving extracted text with Apache Solr

I'm new to Apache Solr, and I want to use it for indexing pdf files. I managed to get it up and running so far and I can now search for added pdf files. However, I need to be able to retrieve the searched text from the results. I found an xml snippet in the default solrconfig.xml concerning exactly that: <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy"> <lst name="defaults">  <str name="fmap

How to index text files using apache solr

阅读更多关于 How to index text files using apache solr

问题 I wanted to index text files. After searching a lot I got to know about Apache tika. Now in some sites where I studied Apache tika, I got to know that Apache tika converts the text it into XML format and then sends it to solr. But while converting it creates only one tag example ....... Now the text file I wish to index is a tomcat local host access file. This file is in GB's. I cannot store it and a single index. I want each line to have line-id ....... So that i can easily retrieve the

How to read large files using TIka?

阅读更多关于 How to read large files using TIka?

I'm parsing large pdf and word documents using Tika but I get he followiing error message. Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). How can I increase the limit? Assuming you're basically following the Tika example for extracting to plain text , then all you need to do is create your BodyContentHandler with a write limit of -1 to disable the write limit, as explained in the javadocs Your code would then look something like (

Index pdf file content using Apache Solr

阅读更多关于 Index pdf file content using Apache Solr

问题 I'm using Solr's php extension for interacting with Apache Solr. I'm indexing data from the database. I wanted to index contents of external files (like PDFs, PPTX) as well. The logic for indexing is: Suppose the schema.xml has the following fields defined: <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="created" type="tlong" indexed="true" stored="true" /> <field name="name" type="text_general" indexed="true" stored="true"/>