Extract table from a PDF

独自空忆成欢 提交于 2019-12-21 02:54:26

问题


I am trying to extract a table from a pdf document

I tried the route of pdf -> html -> extract table. The pdf that I mentioned above when converted to html produces garbage, maybe because of the font, the document is not in english.

Extracting the pdf using x and y coordinate is not an option as this solution needs to work for future pdf from the url mention above which will have the table but not always in the same position.

Please help,

Thanks in advance.


回答1:


The PDF does not contain explicit table data. It only contains lines and character glyphs which we tend to interpret as tables. Thus your task involves putting our human table recognition capabilities into code which is quite a task.

Generally speaking, if you are sure enough future PDFs will be generated by the same software in a very similar manner, it might be worth the time to investigate the file for some easy to follow hints to recognize the contents of individual fields.

Your specific document, though, has an additional shortcoming: It does not contain the required information for direct text extraction! You can try copying & pasting from Adobe Reader and you'll get (at least I do) semi-random characters from the WinAnsi range.

This is due to the fact that all fonts in the document claim that they use WinAnsiEncoding even though the characters referenced this way definitively are not from the WinAnsi character selection.

Thus reliable text extraction from your document without OCR is impossible after all!

(Trying copy&paste from Adobe Reader generally is a good first test whether text extraction is feasible at all; the text extraction methods of the Reader have been developed for many many years and, therefore, have become quite good. If you cannot extract anything sensible with Acrobat Reader, text extraction will be a very difficult task indeed.)




回答2:


You could use Tabula: http://tabula.nerdpower.org It's free and kinda easy to use




回答3:


One option is to use pdf-table-extract: https://github.com/ashima/pdf-table-extract.



来源:https://stackoverflow.com/questions/17591426/extract-table-from-a-pdf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!