Get text layer of a PDF as is and pass it to another PDF

问题

Good afternoon , I have a problem in my project, this is PDF compression , the process is as follows: Extract images from a PDF Hang OCR Compression Stock OCR + Merge image and convert PDF per page Combine all the generated pdf with OCR, OCR PDFcon one out as a final product. The size of my original file is 11 MB and 4.2 MB compressed . The whole process works perfectly , but the problem that I have is the speed in the OCR process . I was checking on the web, and I saw a way to circumvent that process, which is getting the text layer of the original PDF and pass it to the final PDF is compressed , try some codes like delete all images of the PDF and be alone with the text layer , and insert my compressed images, but the problem compared to the normal process provided above , the weight of the file is increased by more than 4.2 MB , which is not convenient for me. When seeking another solution I found that the handle PDF operators which were handled with PDFBox through the PDFStreamParser , PDStream , COSDictionary . Operators are TJ , TW , TZ , TC ... etc. . My question is if anyone knows pass TJ operate , which is the one that contains the text of a PDF to another , to see if the text layer of the original PDF can be passed to the final PDF is compressed without me 4.2MB high to raise the weight, the idea is not to spend other operators because these can increase the weight of the final PDF or am I mistaken ? If you have any other solution that would help me would be very grateful ? .

Sorry if my English is bad , if anyone knows Spanish tells me to express myself better .

thanks

回答1:

You could use our open source tool pdf2json to get the text layer from your pdf. Just make sure you pass "-hidden" as a parameter to the tool when using it if you want to get text from OCR scanned documents. It supports exporting your data to JSON and XML. Have a look at it here:

http://code.google.com/p/pdf2json/

来源：https://stackoverflow.com/questions/23685946/get-text-layer-of-a-pdf-as-is-and-pass-it-to-another-pdf

标签

pdfbox