apache-tika | 易学教程

How to create Custom model using OpenNLP?

阅读更多关于 How to create Custom model using OpenNLP?

I am trying to extract entities like Names, Skills from document using OpenNLP Java API . but it is not extracting proper Names . I am using model available on opennlp sourceforge link Here is a piece of java code- public class tikaOpenIntro { public static void main(String[] args) throws IOException, SAXException, TikaException { tikaOpenIntro toi = new tikaOpenIntro(); toi.filest(""); String cnt = toi.contentEx(); toi.sentenceD(cnt); toi.tokenization(cnt); String names = toi.namefind(toi.Tokens); toi.files(names); } public String Tokens[]; public String contentEx() throws IOException,

how to extract main text from html using Tika

阅读更多关于 how to extract main text from html using Tika

I just want to know that how i can extract main text and plain text from html using Tika? maybe one possible solution is to use BoilerPipeContentHandler but do you have some sample/demo codes to show it? thanks very much in advance Here is a sample: public String[] tika_autoParser() { String[] result = new String[3]; try { InputStream input = new FileInputStream(new File("/Users/nazanin/Books/Web crawler.pdf")); ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext();

How to index text files using apache solr

阅读更多关于 How to index text files using apache solr

I wanted to index text files. After searching a lot I got to know about Apache tika. Now in some sites where I studied Apache tika, I got to know that Apache tika converts the text it into XML format and then sends it to solr. But while converting it creates only one tag example ....... Now the text file I wish to index is a tomcat local host access file. This file is in GB's. I cannot store it and a single index. I want each line to have line-id ....... So that i can easily retrieve the matching line. Can this be done in Apache Tika? Solr with Tika supports extraction of data from multiple

Remove PDFont caching with Apache tika

阅读更多关于 Remove PDFont caching with Apache tika

问题 I am trying to extract text only from a number of different coduments (rtf doc pdf). I naturally turned to Apache Tika because it can autodetect the document and extract text accordingly. I am only interested in the text and not formatting etc. My application ends up with a big memory leak and on investigating it, this is coming from caching from PDFFont class from the PDFBox dependency. I am not interesting in caching Fontmetrics and other Font formatting issues from pdfs as I want to only

Tika Parser: Exclude PDF Attachments

阅读更多关于 Tika Parser: Exclude PDF Attachments

问题 There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certain (or all) PDF attachments in the Tika config? 回答1: Implement a custom org.apache.tika.extractor.DocumentSelector and set it at the ParseContext . The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed. Example DocumentSelector: public class

“java.lang.SecurityException: Prohibited package name: java.sql” error happen only when executing outside of Eclipse

阅读更多关于 “java.lang.SecurityException: Prohibited package name: java.sql” error happen only when executing outside of Eclipse

I am writing a Topic Modeling program using Apache Tika to extract the text contents from other file type. Actually It run perfectly on Eclipse. But when I export to JAR file to use from command prompt of the Window 10. This error message appear when it try the code: "parser.parse(stream, handler, metadata, parseContext);" "java.lang.SecurityException: Prohibited package name: java.sql" I didn't upload my java code here because I don't think they are the root of the problem. Since it run perfectly inside Eclipse IDE. So do anyone know why it only happen when I try to run it from command line.

Correct use of Apache Tika MediaType

阅读更多关于 Correct use of Apache Tika MediaType

I want to use APache Tika's MediaType class to compare mediaTypes. I first use Tika to detect the MediaType. Then I want to start an action according to the MediaType. So if the MediaType is from type XML I want to do some action, if it is a compressed file I want to start an other action. My problem is that there are many XML types, so how do I check if it is an XML using the MediaType ? Here is my previous (before Tika) implementation: if (contentType.contains("text/xml") || contentType.contains("application/xml") || contentType.contains("application/x-xml") || contentType.contains(

Apache tika detects mime-type incorrectly for csv

阅读更多关于 Apache tika detects mime-type incorrectly for csv

I've created .csv file using excel and I wrote following code using apache tika: public static boolean checkThatMimeTypeIsCsv(InputStream inputStream) throws IOException { BufferedInputStream bis = new BufferedInputStream(inputStream); AutoDetectParser parser = new AutoDetectParser(); Detector detector = parser.getDetector(); Metadata md = new Metadata(); MediaType mediaType = detector.detect(bis, md); return "text/csv".equals(mediaType.toString()); } public static void main(String[] args) throws IOException { System.out.println(checkThatMimeTypeIsCsv(new FileInputStream("Data.csv"))); } But

Apache tika detects mime-type incorrectly for csv

阅读更多关于 Apache tika detects mime-type incorrectly for csv

问题 I've created .csv file using excel and I wrote following code using apache tika: public static boolean checkThatMimeTypeIsCsv(InputStream inputStream) throws IOException { BufferedInputStream bis = new BufferedInputStream(inputStream); AutoDetectParser parser = new AutoDetectParser(); Detector detector = parser.getDetector(); Metadata md = new Metadata(); MediaType mediaType = detector.detect(bis, md); return "text/csv".equals(mediaType.toString()); } public static void main(String[] args)

indexing all documents in doc folder in to solr FileListEntityProcessor

阅读更多关于 indexing all documents in doc folder in to solr FileListEntityProcessor

http://wiki.apache.org/solr/ExtractingRequestHandler does not provide much information how to configure this handler in an webapplication which has its own context and want to use solr as server features as embebdedd solr . Can you please provide some information on how to upload the documents to solr and search for some content from those documents? I have configured DIH as in solrConf.xml <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">tika-data-config.xml</str> </lst> </requestHandler> and tika-data