apache-tika

Stopping a Tika server properly

瘦欲@ 提交于 2019-12-10 13:23:49
问题 In order to start a Tika server that can be accessed from hosts other that localhost we know that the way to go is (say I have version 1.7 and want to run on port 9998) java -jar tika-server-1.7-SNAPSHOT.jar -host 0.0.0.0 My question is: Is there a proper way to properly stop this server with a command or is killing the process the only way? 回答1: As of October 2019 there is no programmatic way to shut it down. Documentation notes: In the future, we may implement a gentler shutdown than we

Files locked after indexing

不羁的心 提交于 2019-12-10 13:15:50
问题 I have the following workflow in my (web)application: download a pdf file from an archive index the file delete the file My problem is that after indexing the file, it remains locked and the delete-part throws an exception. Here is my code-snippet for indexing the file: try { ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract"); req.addFile(file, type); req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); NamedList<Object> result = server.request(req);

Adding language profile to Apache Tika

喜夏-厌秋 提交于 2019-12-10 13:05:40
问题 Could please anybody who managed to do that explain how to do that :-) Do I need to get n-gram files for the language I need to add ? Is it a matter of creating tika.language.override.properties , add some other lang codes and add lang-code.ngp n-gram file on the classPath ? In that case, where do I get it and why Tika doesn't support more languages, if it is just a matter of this ? There are currently these languages supported for language detection da,de,et,el,en,es,fi,fr,hu,is,it,lt,nl,no

Word, PDF document parsing - Hadoop/in-general Java

我的未来我决定 提交于 2019-12-10 11:36:53
问题 My objective is to load MS-Word, PDF etc. documents onto HDFS and extract certain 'content' out of each document and use it further for some analysis. Instead of beginning to fiddle with InputFormat etc., I thought that libraries like Tika can be used and incorporated in MR. The partial content of one of the Word doc. is as follows: 6. Statement of Strategy We have 4 strategic interventions that will deliver a competitive advantage. Innovate upstream and downstream 1. Biopulp. We will execute

How to process/extract .pst using hadoop Map reduce

痴心易碎 提交于 2019-12-10 11:30:25
问题 I am using MAPI tools (Its microsoft lib and in .NET) and then apache TIKA libraries to process and extract the pst from exchange server, which is not scalable. How can I process/extracts pst using MR way ... Is there any tool, library available in java which I can use in my MR jobs. Any help would be great-full . Jpst Lib internally uses: PstFile pstFile = new PstFile(java.io.File) And the problem is for Hadoop API 's we don't have anything close to java.io.File . Following option is always

How to extract metatags from HTML files and index them in SOLR and TIKA

筅森魡賤 提交于 2019-12-10 00:24:50
问题 I am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to display in solr. My HTML file is look like this. <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta name="product_id" content="11"/> <meta name="assetid" content="10001"/> <meta name="title" content="title of the article"/> <meta name="type" content="0xyzb"/> <meta name="category" content="article category

Retrieving extracted text with Apache Solr

我的梦境 提交于 2019-12-09 22:02:48
问题 I'm new to Apache Solr, and I want to use it for indexing pdf files. I managed to get it up and running so far and I can now search for added pdf files. However, I need to be able to retrieve the searched text from the results. I found an xml snippet in the default solrconfig.xml concerning exactly that: <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy"> <lst name="defaults"> <!-- All the main content goes into "text"...

how can I detect farsi web pages by tika?

微笑、不失礼 提交于 2019-12-09 06:59:23
问题 I need a sample code to help me detect farsi language web pages by apache tika toolkit. LanguageIdentifier identifier = new LanguageIdentifier("فارسی"); String language = identifier.getLanguage(); I have download apache.tika jar files and add them to the classpath. but this code gives error for Farsi language but it works for english. how can I add Farsi to languageIdentifier package of tika? 回答1: Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27

How to get file extension from content type?

邮差的信 提交于 2019-12-08 15:07:44
问题 I'm using Apache Tika, and I have files (without extension) of particular content type that need to be renamed to have extension that reflect the content type. Any idea if there is something I could use instead of programming that from scratch based on content type names ? 回答1: The two key classes for you are MediaTypeRegistry and MimeTypes. Using these, you can do mime type magic based detection, and get information on the mime types and their relationships. TikaConfig config = TikaConfig

compare two pdf files (approach) using java [closed]

两盒软妹~` 提交于 2019-12-08 12:23:07
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed last year . i need to write a java class that compares two pdf files and points out the differences(differences in text/position/font) using some sort of highlighting. my initial approach was use pdfbox to parse the file using pdfbox and store the extracted text using in some data structure