apache-tika | 易学教程

Ignoring Header/Footer text when using TIKA

阅读更多关于 Ignoring Header/Footer text when using TIKA

问题 I'm using IKVM in order to use the TIKA library in a .NET application. I'm able to extract text but now I want to tell TIKA that I do NOT want the Header/Footer information. TIKA case TIKA-906 shows that the latest version now includes the header/footer text, but does not show how to exclude it. I'm pretty much using the same code outlined here. Any help would be greatly appreciated. 来源： https://stackoverflow.com/questions/16862346/ignoring-header-footer-text-when-using-tika

Get Filename from Byte Array

阅读更多关于 Get Filename from Byte Array

问题 We can extract the mimetype from byte array, e.g., by using Apache Tika. Is it possible to get Filename from Byte Array. 回答1: No. You can take a guess at a mimetype from the content data itself, but the filename is not in there. 回答2: The header field that you may be looking for is called Content-Disposition. If you're downloading an attachment, then there may be a file name in that field: Content-Disposition: attachment;filename=abc.txt But there's no guarantee that you'll have such a file

Spark 2.x + Tika: java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect

阅读更多关于 Spark 2.x + Tika: java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect

问题 I am trying to resolve a spark-submit classpath runtime issue for an Apache Tika (>v 1.14) parsing job. The problem seems to involve spark-submit classpath vs my uber-jar. Platforms: CDH 5.15 (Spark 2.3 added via CDH docs) and CDH 6 (Spark 2.2 bundled in CDH 6) I've tried / reviewed: (Cloudera) Where does spark-submit look for Jar files? (stackoverflow) resolving-dependency-problems-in-apache-spark (stackoverflow) Apache Tika ArchiveStreamFactory.detect error Highlights: Java 8 / Scala 2.11 I

trying to port Tika 1.0 to Android in Eclipse: error messages refercing pom.xml

阅读更多关于 trying to port Tika 1.0 to Android in Eclipse: error messages refercing pom.xml

问题 I am trying to port Tika 1.0 core and parsers source code to Android in Eclipse and having problems. Here's what I did: Downloaded Tika 1.0 source Opened core and parsers sub-projects in Eclipse using Maven plugin Exported both into their respective JARs Copied the JAR files into the libs folder of a "wrapper" Android project that I want to use to test tika's capabilities on a 4.0 device Cleaned and rebuilt the project When I tried to launch it on a device, I got this error: Error generating

Tika in Action book examples Lucene StandardAnalyzer does not work

阅读更多关于 Tika in Action book examples Lucene StandardAnalyzer does not work

问题 First of all I am a total noob when it comes to Tika and Lucene. I am working through the Tika in Action book trying out the examples. In chapter 5 this example is given: package tikatest01; import java.io.File; import org.apache.tika.Tika; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.Field.Index; import org.apache.lucene.document.Field.Store; import org.apache.lucene.index.IndexWriter; public class LuceneIndexer {

Using POI or Tika to extract text, stream-to-stream without loading the entire file in memory

阅读更多关于 Using POI or Tika to extract text, stream-to-stream without loading the entire file in memory

问题 I'm trying to use either Apache POI and PDFBox by themselves, or within the context of Apache Tika, to extract and process plain text from MASSIVE Microsoft Office and PDF files (i.e. hundreds of megs in some cases). Also, my application is multi-threaded, so I will be parsing many of these large files concurrently. At that scale, I MUST work with the files in a streaming manner. It's not an option to hold an entire file in main memory at any step along the way. I have seen many source code

How to parse octet-stream files using Apache Tika?

阅读更多关于 How to parse octet-stream files using Apache Tika?

问题 I have stored all different types of files on Azure Blob storage, files are txt, doc, pdf,etc. However all the files are stored as 'octet-stream' there and when I open the files to extract the text from them using Tika, Tika cann't detect the character encoding. How can I get around this problem? FileSystem fs = FileSystem.get(new Configuration()); Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd); InputStream stream = fs.open(pt); AutoDetectParser parser = new AutoDetectParser();

Gradle, Tika - Exclude some dependency packages making a “fat jar” too fat

阅读更多关于 Gradle, Tika - Exclude some dependency packages making a “fat jar” too fat

问题 I'm making an app which creates Lucence indices on a handful of well-known document formats (.docx, .odt, .txt, etc.). Tika is ideal for extracting the text but it appears to be the culprit in making my fat jar balloon to 62 MB. To make the fat jar I'm doing this in my build.gradle: buildscript { repositories { jcenter() } dependencies { // fatjar classpath 'com.github.jengelman.gradle.plugins:shadow:1.2.4' } } apply plugin: 'com.github.johnrengelman.shadow' shadowJar { baseName = project

422 Tika server response? Tika-Python

阅读更多关于 422 Tika server response? Tika-Python

问题 I have been trying to get Apache-Tika to work with this python package: https://github.com/chrismattmann/tika-python I have the following code in my python program: #!/usr/bin/env python import tika tika.initVM() from tika import parser parsed = parser.from_file('pdf/myPdf.pdf') But I get a 422 response every time: [MainThread ] [WARNI] Failed to see startup log message; retrying... [MainThread ] [WARNI] Tika server returned status: 422 Apache Tika does work when I use the following command:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.font.PDFont

阅读更多关于 java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.font.PDFont

问题 I am using Apache Tika( tika-app 1.17) in wildfly modules. While I start extracting PDF it always throws the error: java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.font.PDFont For other file extensions it works fine. Things I have tried out dependencies in apache-tika module.xml to PDFbox Explicitly loaded org.apache.pdfbox from standalone.xml I have also tried with the below structure app1.war->(WEB-INF)lib-->app.jar->lib-->tika-app-1.17.jar I have also