apache-tika

How to create Custom model using OpenNLP?

别来无恙 提交于 2019-12-03 17:16:55
I am trying to extract entities like Names, Skills from document using OpenNLP Java API . but it is not extracting proper Names . I am using model available on opennlp sourceforge link Here is a piece of java code- public class tikaOpenIntro { public static void main(String[] args) throws IOException, SAXException, TikaException { tikaOpenIntro toi = new tikaOpenIntro(); toi.filest(""); String cnt = toi.contentEx(); toi.sentenceD(cnt); toi.tokenization(cnt); String names = toi.namefind(toi.Tokens); toi.files(names); } public String Tokens[]; public String contentEx() throws IOException,

how to extract main text from html using Tika

依然范特西╮ 提交于 2019-12-03 16:22:20
I just want to know that how i can extract main text and plain text from html using Tika? maybe one possible solution is to use BoilerPipeContentHandler but do you have some sample/demo codes to show it? thanks very much in advance Here is a sample: public String[] tika_autoParser() { String[] result = new String[3]; try { InputStream input = new FileInputStream(new File("/Users/nazanin/Books/Web crawler.pdf")); ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext();

How to index text files using apache solr

懵懂的女人 提交于 2019-12-03 09:01:39
I wanted to index text files. After searching a lot I got to know about Apache tika. Now in some sites where I studied Apache tika, I got to know that Apache tika converts the text it into XML format and then sends it to solr. But while converting it creates only one tag example ....... Now the text file I wish to index is a tomcat local host access file. This file is in GB's. I cannot store it and a single index. I want each line to have line-id ....... So that i can easily retrieve the matching line. Can this be done in Apache Tika? Solr with Tika supports extraction of data from multiple

Remove PDFont caching with Apache tika

五迷三道 提交于 2019-12-02 20:15:33
问题 I am trying to extract text only from a number of different coduments (rtf doc pdf). I naturally turned to Apache Tika because it can autodetect the document and extract text accordingly. I am only interested in the text and not formatting etc. My application ends up with a big memory leak and on investigating it, this is coming from caching from PDFFont class from the PDFBox dependency. I am not interesting in caching Fontmetrics and other Font formatting issues from pdfs as I want to only

Tika Parser: Exclude PDF Attachments

折月煮酒 提交于 2019-12-02 12:36:58
问题 There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certain (or all) PDF attachments in the Tika config? 回答1: Implement a custom org.apache.tika.extractor.DocumentSelector and set it at the ParseContext . The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed. Example DocumentSelector: public class

“java.lang.SecurityException: Prohibited package name: java.sql” error happen only when executing outside of Eclipse

萝らか妹 提交于 2019-12-01 20:56:02
I am writing a Topic Modeling program using Apache Tika to extract the text contents from other file type. Actually It run perfectly on Eclipse. But when I export to JAR file to use from command prompt of the Window 10. This error message appear when it try the code: "parser.parse(stream, handler, metadata, parseContext);" "java.lang.SecurityException: Prohibited package name: java.sql" I didn't upload my java code here because I don't think they are the root of the problem. Since it run perfectly inside Eclipse IDE. So do anyone know why it only happen when I try to run it from command line.

Correct use of Apache Tika MediaType

痞子三分冷 提交于 2019-12-01 14:23:55
I want to use APache Tika's MediaType class to compare mediaTypes. I first use Tika to detect the MediaType. Then I want to start an action according to the MediaType. So if the MediaType is from type XML I want to do some action, if it is a compressed file I want to start an other action. My problem is that there are many XML types, so how do I check if it is an XML using the MediaType ? Here is my previous (before Tika) implementation: if (contentType.contains("text/xml") || contentType.contains("application/xml") || contentType.contains("application/x-xml") || contentType.contains(

Apache tika detects mime-type incorrectly for csv

﹥>﹥吖頭↗ 提交于 2019-12-01 14:03:24
I've created .csv file using excel and I wrote following code using apache tika: public static boolean checkThatMimeTypeIsCsv(InputStream inputStream) throws IOException { BufferedInputStream bis = new BufferedInputStream(inputStream); AutoDetectParser parser = new AutoDetectParser(); Detector detector = parser.getDetector(); Metadata md = new Metadata(); MediaType mediaType = detector.detect(bis, md); return "text/csv".equals(mediaType.toString()); } public static void main(String[] args) throws IOException { System.out.println(checkThatMimeTypeIsCsv(new FileInputStream("Data.csv"))); } But

Apache tika detects mime-type incorrectly for csv

不羁的心 提交于 2019-12-01 11:43:27
问题 I've created .csv file using excel and I wrote following code using apache tika: public static boolean checkThatMimeTypeIsCsv(InputStream inputStream) throws IOException { BufferedInputStream bis = new BufferedInputStream(inputStream); AutoDetectParser parser = new AutoDetectParser(); Detector detector = parser.getDetector(); Metadata md = new Metadata(); MediaType mediaType = detector.detect(bis, md); return "text/csv".equals(mediaType.toString()); } public static void main(String[] args)

indexing all documents in doc folder in to solr FileListEntityProcessor

喜你入骨 提交于 2019-12-01 11:39:57
http://wiki.apache.org/solr/ExtractingRequestHandler does not provide much information how to configure this handler in an webapplication which has its own context and want to use solr as server features as embebdedd solr . Can you please provide some information on how to upload the documents to solr and search for some content from those documents? I have configured DIH as in solrConf.xml <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">tika-data-config.xml</str> </lst> </requestHandler> and tika-data