PDFBox retrieve text from overlapping boxes

♀尐吖头ヾ 提交于 2019-12-23 05:01:49

问题


I've had some success using the PDFTextStripperByArea class to retrieve text contained within a specified rectangle. However, some of the PDFs I an scraping have text that is in slightly different places from page to page. I'm looking for help in how to deal with this.

In the example below, I can open the PDF in Acrobat Edit mode and see multiple text boxes (outlines with thin grey lines). I have indicated two regions (purple and red) that I would like to extract text from. However, instead of just getting the text physically inside the rectangle, I'd like all the text from the overlapping text boxes.

Is there a way to do this?

来源:https://stackoverflow.com/questions/46719503/pdfbox-retrieve-text-from-overlapping-boxes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!