apache-tika

ContentExtraction of PDF file in solr using Apache Tika

ⅰ亾dé卋堺 提交于 2019-12-07 09:53:48
I am trying to index the PDF file in the solr using the following tutorial http://wiki.apache.org/solr/ExtractingRequestHandler But everytime i am firing the command java -jar post.jar *.pdf it says some org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 Error Kindly help me in indexing the PDF to solr server.Is there any other integration then tika which can help me. Post.jar is just an utility to upload files to Solr. Solr uses Extract handler so you need to provide as url. e.g. java -Durl=http://localhost:8983/solr/update/extract?literal.id=1 -Dtype=application/pdf -jar

How do I index rich-format documents contained as database BLOBs with Solr 4.0+?

落爺英雄遲暮 提交于 2019-12-07 08:47:34
问题 I've found a few related solutions to this problem. The related solutions will not work for me as I'll explain. (I'm using Solr 4.0 and indexing data stored in an Oracle 11g database.) Jonck van der Kogel's related solution (from 2009) is explained here. He describes creating a custom Transformer, sort of like the ClobTransformer that ships with Solr. This is going down the elegant path but is not using Tika which is now integrated with Solr. (He uses external PDFBox and FontBox.) This

Is Apache Tika able to extract foreign languages like Chinese, Japanese?

◇◆丶佛笑我妖孽 提交于 2019-12-07 06:42:55
问题 Is Apache Tika able to extract foreign languages like Chinese, Japanese? I have the following code: Detector detector = new DefaultDetector(); Parser parser = new AutoDetectParser(detector); InputStream stream = new ByteArrayInputStream(bytes); OutputStream outputstream = new ByteArrayOutputStream(); ContentHandler textHandler = new BodyContentHandler(outputstream); Metadata metadata = new Metadata(); // Set<String> langs = LanguageIdentifier.getSupportedLanguages(); // metadata.set(Metadata

Solr : data import handler and solr cell

柔情痞子 提交于 2019-12-06 15:58:50
问题 Is it possible to index rich document (pdf, office)... with data import handler using solr cell. I use solr 3.2. Thanks. 回答1: Solr Cell, aka ExtractingRequestHandler, uses Apache Tika behind the scenes, and the latter can easily be integrated into a DataImportHandler: <dataConfig> <!-- use any of type DataSource<InputStream> --> <dataSource type="BinURLDataSource"/> <document> <!-- The value of format can be text|xml|html|none. this is the format in which the body is emited (the 'text' field)

python send file to tika running as a service

梦想的初衷 提交于 2019-12-06 12:38:52
问题 Reference to this question I would like to send a MS Word (.doc) file to a tika application running as a service, how can I do this? There is this link for running tika: http://mimi.kaktusteam.de/blog-posts/2013/02/running-apache-tika-in-server-mode/ But for the python code to access it I am not sure if I can use sockets or urllib or what exactly? 回答1: For remote access to Tika, there are basically two methods available. One is the Tika JAXRS Server, which provides a full RESTful interface.

Is Tika compatible with android?

孤者浪人 提交于 2019-12-06 11:03:30
问题 I have seen the 1.0 release of Apache Tika, which ease a lot metadata extraction in Java, and I'm wondering if it can be used in Android. 回答1: I'd suspect you should be fine to port the core of Tika to Android. However, you're likely to have issues with a lot of the dependencies of Tika, so many of the parsers won't work For example, one of the dependencies of Apache Tika is Apache POI. People have tried to compile POI for Android, but have hit issues with the method limit that Android

How to process/extract .pst using hadoop Map reduce

廉价感情. 提交于 2019-12-06 06:40:33
I am using MAPI tools (Its microsoft lib and in .NET) and then apache TIKA libraries to process and extract the pst from exchange server, which is not scalable. How can I process/extracts pst using MR way ... Is there any tool, library available in java which I can use in my MR jobs. Any help would be great-full . Jpst Lib internally uses: PstFile pstFile = new PstFile(java.io.File) And the problem is for Hadoop API 's we don't have anything close to java.io.File . Following option is always there but not efficient: File tempFile = File.createTempFile("myfile", ".tmp"); fs.moveToLocalFile(new

Classpath issues running Tika on Spark

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-06 02:50:00
I try to process a bunch of files in Tika. The number of files is in the thousands so I decided to build an RDD of files and let Spark distribute the workload. Unfortunatly I get multiple NoClassDefFound Exceptions. This is my sbt file: name := "TikaFileParser" version := "0.1" scalaVersion := "2.11.7" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" % "provided" libraryDependencies += "org.apache.tika" % "tika-core" % "1.11" libraryDependencies += "org.apache.tika" % "tika-parsers" % "1.11" libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.1" % "provided"

Solr ExtractingRequestHandler giving empty content for pdf documents

风格不统一 提交于 2019-12-05 18:54:05
I am using ExtractingRequestHandler in Solr for getting document content and index it. It works fine for all Microsoft Documents, but for PDFs, the content being extracted is empty. I have also tried the extractOnly=true with curl, and that also returns just the empty body. I have used TIKA independently on the same documents and that extracts content just fine. The difference is when doing independently I am using BodyContentHander that comes with Tika instead of SolrContentHandler which is used by Solr. Has anybody seen this? I would really rather let Solr handle it than me using Tika to

Searching attachments from a Rails app (Word, PDF, Excel etc)

匆匆过客 提交于 2019-12-05 16:26:33
My first post to Stack Overflow so be gentle please! I am about to start a new Ruby on Rails (3.1) project for a client. One of their requirements is that there is a search engine, which will be indexing roughly 2,000 documents which are a mixture of PDF, Word, Excel and HTML. I had hoped to use either thinking-sphinx or Texticle (most popular at https://www.ruby-toolbox.com/categories/rails_search.html ) but as I understand it: Texticle requires PostgreSQL. I'm on MySQL. thinking-sphinx doesn't index files on the file system. even if I saved my attachments into the database, thinking-sphinx