apache-tika | 易学教程

Stopping a Tika server properly

阅读更多关于 Stopping a Tika server properly

问题 In order to start a Tika server that can be accessed from hosts other that localhost we know that the way to go is (say I have version 1.7 and want to run on port 9998) java -jar tika-server-1.7-SNAPSHOT.jar -host 0.0.0.0 My question is: Is there a proper way to properly stop this server with a command or is killing the process the only way? 回答1: As of October 2019 there is no programmatic way to shut it down. Documentation notes: In the future, we may implement a gentler shutdown than we

Files locked after indexing

阅读更多关于 Files locked after indexing

问题 I have the following workflow in my (web)application: download a pdf file from an archive index the file delete the file My problem is that after indexing the file, it remains locked and the delete-part throws an exception. Here is my code-snippet for indexing the file: try { ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract"); req.addFile(file, type); req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); NamedList<Object> result = server.request(req);

Adding language profile to Apache Tika

阅读更多关于 Adding language profile to Apache Tika

问题 Could please anybody who managed to do that explain how to do that :-) Do I need to get n-gram files for the language I need to add ? Is it a matter of creating tika.language.override.properties , add some other lang codes and add lang-code.ngp n-gram file on the classPath ? In that case, where do I get it and why Tika doesn't support more languages, if it is just a matter of this ? There are currently these languages supported for language detection da,de,et,el,en,es,fi,fr,hu,is,it,lt,nl,no

Word, PDF document parsing - Hadoop/in-general Java

阅读更多关于 Word, PDF document parsing - Hadoop/in-general Java

问题 My objective is to load MS-Word, PDF etc. documents onto HDFS and extract certain 'content' out of each document and use it further for some analysis. Instead of beginning to fiddle with InputFormat etc., I thought that libraries like Tika can be used and incorporated in MR. The partial content of one of the Word doc. is as follows: 6. Statement of Strategy We have 4 strategic interventions that will deliver a competitive advantage. Innovate upstream and downstream 1. Biopulp. We will execute

How to process/extract .pst using hadoop Map reduce

阅读更多关于 How to process/extract .pst using hadoop Map reduce

问题 I am using MAPI tools (Its microsoft lib and in .NET) and then apache TIKA libraries to process and extract the pst from exchange server, which is not scalable. How can I process/extracts pst using MR way ... Is there any tool, library available in java which I can use in my MR jobs. Any help would be great-full . Jpst Lib internally uses: PstFile pstFile = new PstFile(java.io.File) And the problem is for Hadoop API 's we don't have anything close to java.io.File . Following option is always

How to extract metatags from HTML files and index them in SOLR and TIKA

阅读更多关于 How to extract metatags from HTML files and index them in SOLR and TIKA

问题 I am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to display in solr. My HTML file is look like this. <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta name="product_id" content="11"/> <meta name="assetid" content="10001"/> <meta name="title" content="title of the article"/> <meta name="type" content="0xyzb"/> <meta name="category" content="article category

Retrieving extracted text with Apache Solr

阅读更多关于 Retrieving extracted text with Apache Solr

问题 I'm new to Apache Solr, and I want to use it for indexing pdf files. I managed to get it up and running so far and I can now search for added pdf files. However, I need to be able to retrieve the searched text from the results. I found an xml snippet in the default solrconfig.xml concerning exactly that: <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy"> <lst name="defaults"> <!-- All the main content goes into "text"...

how can I detect farsi web pages by tika?

阅读更多关于 how can I detect farsi web pages by tika?

问题 I need a sample code to help me detect farsi language web pages by apache tika toolkit. LanguageIdentifier identifier = new LanguageIdentifier("فارسی"); String language = identifier.getLanguage(); I have download apache.tika jar files and add them to the classpath. but this code gives error for Farsi language but it works for english. how can I add Farsi to languageIdentifier package of tika? 回答1: Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27

How to get file extension from content type?

阅读更多关于 How to get file extension from content type?

问题 I'm using Apache Tika, and I have files (without extension) of particular content type that need to be renamed to have extension that reflect the content type. Any idea if there is something I could use instead of programming that from scratch based on content type names ? 回答1: The two key classes for you are MediaTypeRegistry and MimeTypes. Using these, you can do mime type magic based detection, and get information on the mime types and their relationships. TikaConfig config = TikaConfig

compare two pdf files (approach) using java [closed]

阅读更多关于 compare two pdf files (approach) using java [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed last year . i need to write a java class that compares two pdf files and points out the differences(differences in text/position/font) using some sort of highlighting. my initial approach was use pdfbox to parse the file using pdfbox and store the extracted text using in some data structure