apache-tika

Index pdf file content using Apache Solr

此生再无相见时 提交于 2019-12-01 11:06:08
I'm using Solr's php extension for interacting with Apache Solr. I'm indexing data from the database. I wanted to index contents of external files (like PDFs, PPTX) as well. The logic for indexing is: Suppose the schema.xml has the following fields defined: <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="created" type="tlong" indexed="true" stored="true" /> <field name="name" type="text_general" indexed="true" stored="true"/> <field name="filepath" type="text_general" indexed="false" stored="true"/> <field name="filecontent" type=

indexing all documents in doc folder in to solr FileListEntityProcessor

人走茶凉 提交于 2019-12-01 10:22:10
问题 http://wiki.apache.org/solr/ExtractingRequestHandler does not provide much information how to configure this handler in an webapplication which has its own context and want to use solr as server features as embebdedd solr . Can you please provide some information on how to upload the documents to solr and search for some content from those documents? I have configured DIH as in solrConf.xml <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst

Unable to extract scanned pdf using TesseractOCRConfig Apache Tika

有些话、适合烂在心里 提交于 2019-12-01 09:07:30
My pdf contains scanned images and I want to extract text from it. What I tried : I tried with AutoDetectParsers but no output. I followed the solution provided in Apache Tika extract scanned PDF files and also Apache Tika Jira at https://issues.apache.org/jira/browse/TIKA-1729 but getting empty string without any error. My configuration : Win 7 64-bit OS, JDK 1.8.0_45. Any kind of help is welcome. Steps to follow to solve this : Install Tesseract in your system using 'tesseract-ocr-setup-3.05.00dev.exe' for Windows from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ and set its

Mimetype check using Tika jars

时光总嘲笑我的痴心妄想 提交于 2019-12-01 08:09:41
I am developing standard alone Java batch process. I am trying to determine file attachment mimetype using Tika Jars. I am using Tika 1.4 Jar files. My code look like Parser parser= new AutoDetectParser(); InputStream stream = new FileInputStream(fileAttachment); int writerHandler =-1; ContentHandler contentHandler= new BodyContentHandler(writerHandler); Metadata metadata= new Metadata(); parser.parse(stream, contentHandler, metadata, new ParseContext()); String mimeType = metadata.get(Metadata.CONTENT_TYPE); logger.debug("File Attachment: "+fileattachment.getName()+" MimeType is: "+mimeType);

Extract Images from PDF with Apache Tika

三世轮回 提交于 2019-12-01 01:07:33
Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work. My use case is that I want some code that will extract the content and separately the images from any documents (not necessarily PDFs). This then gets passed into an Apache UIMA pipeline. I've been able to extract images from other document types by using a custom parser (built on an AutoParser) to convert the documents to HTML and then save the images out separately. When I try with PDFs though, the tags don't even appear in the HTML, let along give me access to the

unable to configure Tika1.2 with solr4

痴心易碎 提交于 2019-11-30 20:41:50
问题 I am trying to use TikaEntityProcessor to index the .html file content. Somehow I am not able to get it correctly. I have checked the error log and I got the following error. SEVERE: Full Import failed:java.lang.RuntimeException:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load EntityProcessor implementation for entity:tika-test Processing Document # 1 at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:273) at org.apache.solr.handler

PDFBox adding white spaces within words

社会主义新天地 提交于 2019-11-30 17:56:36
When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly. I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page : http://www.sheffield.gov.uk/roads/children/parents/6-11/pedestrian-training I've tried with several other PDF files and it seems to be doing same on several pages. I do the following: java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped training pdf.pdf on the downloaded file and you will see spaces in following inserted wrongly in the result on console: "•

How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

ⅰ亾dé卋堺 提交于 2019-11-30 16:32:48
问题 I'm trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius package (using this tutorial here: http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html) but the code throws an error. Using the TIKA package however I was able to pass files and parse them but Python is only able to extract metadata and when asked to parse content, Python returns output "none". It is able

Convert .docx to HTML using JAVA

拟墨画扇 提交于 2019-11-30 15:16:40
问题 I tried converting .doc to HTML by using WordToHtmlConverter and it worked perfectly. But when i tried to convert .docx to HTML, i got stuck with it. What i tried: I used the below code to convert .docx to HTML: The code which i tried from : How to use Tika's XWPFWordExtractorDecorator class? InputStream input = TikaInputStream.get(new File("C:\\Users\\Downloads\\filename.docx")); Parser parser = new AutoDetectParser(); StringWriter sw = new StringWriter(); SAXTransformerFactory factory =

Elasticsearch Parse Exception error when attempting to index PDF

旧巷老猫 提交于 2019-11-30 11:25:17
I'm just getting started with elasticsearch. Our requirement has us needing to index thousands of PDF files and I'm having a hard time getting just ONE of them to index successfully. Installed the Attachment Type plugin and got response: Installed mapper-attachments . Followed the Attachment Type in Action tutorial but the process hangs and I don't know how to interpret the error message . Also tried the gist which hangs in the same place. $ curl -X POST "localhost:9200/test/attachment/" -d json.file {"error":"ElasticSearchParseException[Failed to derive xcontent from (offset=0, length=9):