apache-tika | 易学教程

Index pdf file content using Apache Solr

阅读更多关于 Index pdf file content using Apache Solr

I'm using Solr's php extension for interacting with Apache Solr. I'm indexing data from the database. I wanted to index contents of external files (like PDFs, PPTX) as well. The logic for indexing is: Suppose the schema.xml has the following fields defined: <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="created" type="tlong" indexed="true" stored="true" /> <field name="name" type="text_general" indexed="true" stored="true"/> <field name="filepath" type="text_general" indexed="false" stored="true"/> <field name="filecontent" type=

indexing all documents in doc folder in to solr FileListEntityProcessor

阅读更多关于 indexing all documents in doc folder in to solr FileListEntityProcessor

问题 http://wiki.apache.org/solr/ExtractingRequestHandler does not provide much information how to configure this handler in an webapplication which has its own context and want to use solr as server features as embebdedd solr . Can you please provide some information on how to upload the documents to solr and search for some content from those documents? I have configured DIH as in solrConf.xml <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst

Unable to extract scanned pdf using TesseractOCRConfig Apache Tika

阅读更多关于 Unable to extract scanned pdf using TesseractOCRConfig Apache Tika

My pdf contains scanned images and I want to extract text from it. What I tried : I tried with AutoDetectParsers but no output. I followed the solution provided in Apache Tika extract scanned PDF files and also Apache Tika Jira at https://issues.apache.org/jira/browse/TIKA-1729 but getting empty string without any error. My configuration : Win 7 64-bit OS, JDK 1.8.0_45. Any kind of help is welcome. Steps to follow to solve this : Install Tesseract in your system using 'tesseract-ocr-setup-3.05.00dev.exe' for Windows from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ and set its

Mimetype check using Tika jars

阅读更多关于 Mimetype check using Tika jars

I am developing standard alone Java batch process. I am trying to determine file attachment mimetype using Tika Jars. I am using Tika 1.4 Jar files. My code look like Parser parser= new AutoDetectParser(); InputStream stream = new FileInputStream(fileAttachment); int writerHandler =-1; ContentHandler contentHandler= new BodyContentHandler(writerHandler); Metadata metadata= new Metadata(); parser.parse(stream, contentHandler, metadata, new ParseContext()); String mimeType = metadata.get(Metadata.CONTENT_TYPE); logger.debug("File Attachment: "+fileattachment.getName()+" MimeType is: "+mimeType);

Extract Images from PDF with Apache Tika

阅读更多关于 Extract Images from PDF with Apache Tika

Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work. My use case is that I want some code that will extract the content and separately the images from any documents (not necessarily PDFs). This then gets passed into an Apache UIMA pipeline. I've been able to extract images from other document types by using a custom parser (built on an AutoParser) to convert the documents to HTML and then save the images out separately. When I try with PDFs though, the tags don't even appear in the HTML, let along give me access to the

unable to configure Tika1.2 with solr4

阅读更多关于 unable to configure Tika1.2 with solr4

问题 I am trying to use TikaEntityProcessor to index the .html file content. Somehow I am not able to get it correctly. I have checked the error log and I got the following error. SEVERE: Full Import failed:java.lang.RuntimeException:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load EntityProcessor implementation for entity:tika-test Processing Document # 1 at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:273) at org.apache.solr.handler

PDFBox adding white spaces within words

阅读更多关于 PDFBox adding white spaces within words

When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly. I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page : http://www.sheffield.gov.uk/roads/children/parents/6-11/pedestrian-training I've tried with several other PDF files and it seems to be doing same on several pages. I do the following: java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped training pdf.pdf on the downloaded file and you will see spaces in following inserted wrongly in the result on console: "•

How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

阅读更多关于 How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

问题 I'm trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius package (using this tutorial here: http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html) but the code throws an error. Using the TIKA package however I was able to pass files and parse them but Python is only able to extract metadata and when asked to parse content, Python returns output "none". It is able

Convert .docx to HTML using JAVA

阅读更多关于 Convert .docx to HTML using JAVA

问题 I tried converting .doc to HTML by using WordToHtmlConverter and it worked perfectly. But when i tried to convert .docx to HTML, i got stuck with it. What i tried: I used the below code to convert .docx to HTML: The code which i tried from : How to use Tika's XWPFWordExtractorDecorator class? InputStream input = TikaInputStream.get(new File("C:\\Users\\Downloads\\filename.docx")); Parser parser = new AutoDetectParser(); StringWriter sw = new StringWriter(); SAXTransformerFactory factory =

Elasticsearch Parse Exception error when attempting to index PDF

阅读更多关于 Elasticsearch Parse Exception error when attempting to index PDF

I'm just getting started with elasticsearch. Our requirement has us needing to index thousands of PDF files and I'm having a hard time getting just ONE of them to index successfully. Installed the Attachment Type plugin and got response: Installed mapper-attachments . Followed the Attachment Type in Action tutorial but the process hangs and I don't know how to interpret the error message . Also tried the gist which hangs in the same place. $ curl -X POST "localhost:9200/test/attachment/" -d json.file {"error":"ElasticSearchParseException[Failed to derive xcontent from (offset=0, length=9):