apache-tika

Ignoring Header/Footer text when using TIKA

早过忘川 提交于 2019-12-13 02:48:12
问题 I'm using IKVM in order to use the TIKA library in a .NET application. I'm able to extract text but now I want to tell TIKA that I do NOT want the Header/Footer information. TIKA case TIKA-906 shows that the latest version now includes the header/footer text, but does not show how to exclude it. I'm pretty much using the same code outlined here. Any help would be greatly appreciated. 来源: https://stackoverflow.com/questions/16862346/ignoring-header-footer-text-when-using-tika

Get Filename from Byte Array

六眼飞鱼酱① 提交于 2019-12-12 19:53:58
问题 We can extract the mimetype from byte array, e.g., by using Apache Tika. Is it possible to get Filename from Byte Array. 回答1: No. You can take a guess at a mimetype from the content data itself, but the filename is not in there. 回答2: The header field that you may be looking for is called Content-Disposition. If you're downloading an attachment, then there may be a file name in that field: Content-Disposition: attachment;filename=abc.txt But there's no guarantee that you'll have such a file

Spark 2.x + Tika: java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect

假装没事ソ 提交于 2019-12-12 12:25:32
问题 I am trying to resolve a spark-submit classpath runtime issue for an Apache Tika (>v 1.14) parsing job. The problem seems to involve spark-submit classpath vs my uber-jar. Platforms: CDH 5.15 (Spark 2.3 added via CDH docs) and CDH 6 (Spark 2.2 bundled in CDH 6) I've tried / reviewed: (Cloudera) Where does spark-submit look for Jar files? (stackoverflow) resolving-dependency-problems-in-apache-spark (stackoverflow) Apache Tika ArchiveStreamFactory.detect error Highlights: Java 8 / Scala 2.11 I

trying to port Tika 1.0 to Android in Eclipse: error messages refercing pom.xml

倾然丶 夕夏残阳落幕 提交于 2019-12-12 05:40:01
问题 I am trying to port Tika 1.0 core and parsers source code to Android in Eclipse and having problems. Here's what I did: Downloaded Tika 1.0 source Opened core and parsers sub-projects in Eclipse using Maven plugin Exported both into their respective JARs Copied the JAR files into the libs folder of a "wrapper" Android project that I want to use to test tika's capabilities on a 4.0 device Cleaned and rebuilt the project When I tried to launch it on a device, I got this error: Error generating

Tika in Action book examples Lucene StandardAnalyzer does not work

╄→гoц情女王★ 提交于 2019-12-12 04:46:58
问题 First of all I am a total noob when it comes to Tika and Lucene. I am working through the Tika in Action book trying out the examples. In chapter 5 this example is given: package tikatest01; import java.io.File; import org.apache.tika.Tika; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.Field.Index; import org.apache.lucene.document.Field.Store; import org.apache.lucene.index.IndexWriter; public class LuceneIndexer {

Using POI or Tika to extract text, stream-to-stream without loading the entire file in memory

天涯浪子 提交于 2019-12-12 04:05:10
问题 I'm trying to use either Apache POI and PDFBox by themselves, or within the context of Apache Tika, to extract and process plain text from MASSIVE Microsoft Office and PDF files (i.e. hundreds of megs in some cases). Also, my application is multi-threaded, so I will be parsing many of these large files concurrently. At that scale, I MUST work with the files in a streaming manner. It's not an option to hold an entire file in main memory at any step along the way. I have seen many source code

How to parse octet-stream files using Apache Tika?

我的梦境 提交于 2019-12-12 03:36:28
问题 I have stored all different types of files on Azure Blob storage, files are txt, doc, pdf,etc. However all the files are stored as 'octet-stream' there and when I open the files to extract the text from them using Tika, Tika cann't detect the character encoding. How can I get around this problem? FileSystem fs = FileSystem.get(new Configuration()); Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd); InputStream stream = fs.open(pt); AutoDetectParser parser = new AutoDetectParser();

Gradle, Tika - Exclude some dependency packages making a “fat jar” too fat

心已入冬 提交于 2019-12-12 01:25:17
问题 I'm making an app which creates Lucence indices on a handful of well-known document formats (.docx, .odt, .txt, etc.). Tika is ideal for extracting the text but it appears to be the culprit in making my fat jar balloon to 62 MB. To make the fat jar I'm doing this in my build.gradle: buildscript { repositories { jcenter() } dependencies { // fatjar classpath 'com.github.jengelman.gradle.plugins:shadow:1.2.4' } } apply plugin: 'com.github.johnrengelman.shadow' shadowJar { baseName = project

422 Tika server response? Tika-Python

一曲冷凌霜 提交于 2019-12-11 19:47:49
问题 I have been trying to get Apache-Tika to work with this python package: https://github.com/chrismattmann/tika-python I have the following code in my python program: #!/usr/bin/env python import tika tika.initVM() from tika import parser parsed = parser.from_file('pdf/myPdf.pdf') But I get a 422 response every time: [MainThread ] [WARNI] Failed to see startup log message; retrying... [MainThread ] [WARNI] Tika server returned status: 422 Apache Tika does work when I use the following command:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.font.PDFont

隐身守侯 提交于 2019-12-11 14:08:25
问题 I am using Apache Tika( tika-app 1.17) in wildfly modules. While I start extracting PDF it always throws the error: java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.font.PDFont For other file extensions it works fine. Things I have tried out dependencies in apache-tika module.xml to PDFbox Explicitly loaded org.apache.pdfbox from standalone.xml I have also tried with the below structure app1.war->(WEB-INF)lib-->app.jar->lib-->tika-app-1.17.jar I have also