apache-tika

How to get the text content files with tika 1.6?

 ̄綄美尐妖づ 提交于 2019-12-11 13:29:58
问题 Hi i try get the text content from any files in this list pdf,txt,doc,docx and odt the implementation with tika previously worked fine but now is broken, The code is it: ``` public void uploadFile(FileUploadEvent event) throws Exception { UploadedFile file = event.getUploadedFile(); byte[] data = file.getData(); Tika tika = new Tika(); string = tika.parseToString(new ByteArrayInputStream(data)); ... } ``` Any ideas? , bad implementation ? 回答1: You need to add tika-parsers. For example with

How to parse arabic pdf with Tika

二次信任 提交于 2019-12-11 08:23:32
问题 I've installed tika with solr , and it's working well for arabic pdf , is there any tutorial to make this happen , I've seen a similar question to this and the solution was to include ICU4J.jar , but I don't now what does it mean 回答1: ICU4J can be downloaded here: http://site.icu-project.org/download 来源: https://stackoverflow.com/questions/10076959/how-to-parse-arabic-pdf-with-tika

Using Apache Tika 1.9 in Netbeans 8.0.2 and Java 8 produces HUGE executable. What to do to reduce size?

我们两清 提交于 2019-12-11 05:17:11
问题 I haven't had much luck with external libraries, so I've just included the source for utilities in any project that uses utilities. Now I have a project that requires Apache Tika, so I have to have a library setup something like this: But to run the program from outside Netbeans, I apparently (per readme.txt in dist folder) need to zip the .jar and lib folder, unzip that zipped file, extract the contents, and run from wherever it's extracted to. But the Tika lib is 45MB. I only use 5 objects

Solr Index PDF documents and post them to a remote server

笑着哭i 提交于 2019-12-11 05:06:50
问题 Hi I am a naive user when it come to Solr. Please guide me on the following hurdles. 1) Solr Index PDF documents Solution tried I used tika-app 0.9.jar to extract the content from the Input PDF files to text file. Now I am trying to write a java code to index the documents to Solr. 2) Post them to a remote server I need to post either the documents or the index to a central remote server. Can curl command be used for this. Regards Balaji. 回答1: 1) Solr Index PDF documents - I believe Solr does

unable to run java command from cgi

倾然丶 夕夏残阳落幕 提交于 2019-12-11 03:44:08
问题 I have this function to read a doc file using tika on linux: def read_doc(doc_path): output_path=doc_path+'.txt' java_path='/home/jdk1.7.0_17/jre/bin/' environ = os.environ.copy() environ['JAVA_HOME'] =java_path environ['PATH'] =java_path tika_path=java_path+'tika-app-1.3.jar' shell_command='java -jar %s --text --encoding=utf-8 "%s" >"%s"'%(tika_path,doc_path,output_path) proc=subprocess.Popen(shell_command,shell=True, env=environ,cwd=java_path) proc.wait() This function works fine when I run

How to detect image in a document

白昼怎懂夜的黑 提交于 2019-12-10 21:27:15
问题 How can I detect images in a document say doc,xls,ppt or pdf ? I came across with Apache Tika, I am trying its command line option. http://tika.apache.org/1.2/gettingstarted.html But not quite sure how it will detect images. Any help is appreciated. Thanks 回答1: You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then

Solr's TikaEntityProcessor not working

≯℡__Kan透↙ 提交于 2019-12-10 18:13:35
问题 I'm trying to get Solr to index a database in which one column is a filename of a PDF document I'd like to index. My configuration looks like this: <dataConfig> <dataSource name="ds-db" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/document_db" user="user" password="password" readOnly="true"/> <dataSource name="ds-file" type="BinFileDataSource"/> <document name="documents"> <entity name="document" dataSource="ds-db" query="select * from documents"> <entity processor=

Apache tika: remove extra line breaks in result string

左心房为你撑大大i 提交于 2019-12-10 17:21:46
问题 I have html file: <html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"> <div>Test message.</div> <div> </div> <div>More content here...</div> <div> </div> <div>Best regards,</div> <div>Mr. Crowley</div></div></body></html> I try to get content of the file above using Apache Tika... final InputStream input = new FileInputStream("file.html"); final ContentHandler handler = new BodyContentHandler(); final Metadata metadata = new Metadata(); final HtmlParser htmlParser =

How to add new mime type to apache tika

两盒软妹~` 提交于 2019-12-10 15:42:13
问题 This is my class for reading mime types. I am trying to add a new mime type(properties file) and read it. This is my class file: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools | Templates * and open the template in the editor. */ package check_mime; import java.io.IOException; import java.nio.file.Path; import java.nio.file.Paths; import org.apache.tika.Tika; import org.apache.tika.mime.MimeTypes; public class

Is it possible to extract table infomation using Apache Tika?

守給你的承諾、 提交于 2019-12-10 14:14:37
问题 I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my requirement is to extract tabular data where I am expecting 2 columns in a key value format. I checked most of the stuff available in the net for a solution but could not find any. Any pointers for this? 回答1: Well I went ahead and implemented it