Extract text from PDF in code

拥有回忆 提交于 2019-12-04 15:14:17

Unfortunately, I did not working with java and you have to implement it in java code by yourself. Now I'll tell you, how finally I did it:

1) I took the file by your link. PHP is doing it by @fopen("http://...").

2) I opened it as a binary (it is important) and extracted two parts:

2.1) Data 3 0 obj part, which represents creation and modification dates. I did it by regex. It was simple and I mention it above.

2.1) Data stream from 5 0 obj, which represents the deflated data. IMPORTANT! Microsoft Excel inserts two bytes 0D 0A as a line break. Do not forget it, when you filtering the content by regexp. This bytes in the start and in the end have not to be included in extracted string.

3) I inflate a coded stuff by function $uncompressed = @gzuncompress($compressed) and put it in external file. You can see results there

4) Funniest part. The raw data inside the file in textual format. It looks like [(V)-4(RI)16(J)] TJ, and means VRIJ. You can read about texts in PDF in the PDF Reference v1.7, part 5.

5) I believe, the regular expressions can help you extract or/and transform the data.

IMPORTANT: I said "data stream from 5 0 obj", but number of the object "is subject of change". You must control the reference to the object from dictionary->pages->page->content chain. Description of the "bread crumbs" you can find in the manual I mentioned above.

Unfortunately, Excel do not embed any table structure in the PDF, but you can find the coordinates of the text portions and interprete it. Anyway it is a mess.

Do you think, dear Merlin, it is hard? No, dear, it is not. It is not hard, because there is no unicode symbols. The unicode in the PDF is THE REAL SUCK!

Good luck!

This PDF was made by Microsoft Excel and have the date stamps:

3 0 obj
<</Author(Janszen, Jan) 
/CreationDate(D:20120613153635+02'00') 
/ModDate(D:20120613153635+02'00') 
/Producer(˛ˇMicrosoftÆ ExcelÆ 2010) 
/Creator(˛ˇMicrosoftÆ ExcelÆ 2010)>>
endobj

You can use almost any programming language for taking the file by URL and extraction "ModDate" content. New ModDate means information update. For extracting this information you need not any libraries - this is the text in the file, lines 9, 10 and 11.

Ask Jan Janszen to add you in distribution list. The data in the file is encoded. You have to use a lot of programming techniques to reach source and restore information.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!