How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

前端未结

关注

 5  1362

南旧 2021-01-01 07:48

I\'m trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius

5条回答

暗喜 (楼主)

2021-01-01 08:12
can you please share the file you are looking at? The easiest way to do this would be to perhaps attach it to a Github issue in my repository, etc.

That said, if you are trying to use OCR and Tika, you need to run through the Tika OCR guide (http://wiki.apache.org/tika/TikaOCR) and get Tesseract installed. Once Tesseract is installed, then you need to double check whether or not you have an instance of tika-server running (e.g., ps aux | grep tika). If you do, kill it (tika-python runs the Tika REST server in the background as its main interface to Tika; having a fresh running version of it after Tesseract OCR is installed helps to eliminate any odd possibilities).

After you have Tesseract OCR installed, no tika-server running, start your python2.7 interpreter (or script), and then do something like:
```
from tika import parser
parsed = parser.from_file('/path/to/file')
print parsed["content"] # should be the text returned from OCR
```
HTH! --Chris
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...