Get text layer of a PDF as is and pass it to another PDF

对着背影说爱祢 提交于 2019-12-12 01:36:10

问题


Good afternoon , I have a problem in my project, this is PDF compression , the process is as follows: Extract images from a PDF Hang OCR Compression Stock OCR + Merge image and convert PDF per page Combine all the generated pdf with OCR, OCR PDFcon one out as a final product. The size of my original file is 11 MB and 4.2 MB compressed . The whole process works perfectly , but the problem that I have is the speed in the OCR process . I was checking on the web, and I saw a way to circumvent that process, which is getting the text layer of the original PDF and pass it to the final PDF is compressed , try some codes like delete all images of the PDF and be alone with the text layer , and insert my compressed images, but the problem compared to the normal process provided above , the weight of the file is increased by more than 4.2 MB , which is not convenient for me. When seeking another solution I found that the handle PDF operators which were handled with PDFBox through the PDFStreamParser , PDStream , COSDictionary . Operators are TJ , TW , TZ , TC ... etc. . My question is if anyone knows pass TJ operate , which is the one that contains the text of a PDF to another , to see if the text layer of the original PDF can be passed to the final PDF is compressed without me 4.2MB high to raise the weight, the idea is not to spend other operators because these can increase the weight of the final PDF or am I mistaken ? If you have any other solution that would help me would be very grateful ? .

Sorry if my English is bad , if anyone knows Spanish tells me to express myself better .

thanks


回答1:


You could use our open source tool pdf2json to get the text layer from your pdf. Just make sure you pass "-hidden" as a parameter to the tool when using it if you want to get text from OCR scanned documents. It supports exporting your data to JSON and XML. Have a look at it here:

http://code.google.com/p/pdf2json/



来源:https://stackoverflow.com/questions/23685946/get-text-layer-of-a-pdf-as-is-and-pass-it-to-another-pdf

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!