heading and sub-heading extraction from PDF

浪子不回头ぞ 提交于 2019-12-12 18:41:51

问题


I am currently working in extracting text from pdf. my current issue is in distinguishing the headings and sub-headings from the extracted text. I am working with iTextSharp and using the bold text information to detect the heading. The font size cannot be trusted all the time. also tried with PDFBox.

1)I would like to know is there any method to identify headings and sub-headings from PDF.

2)Is adobe or pdfExchange editor provide any API for the same?

For example:

I need to extract

"Tourism in 2040: Bringing an additional one million visitors per year to paradise" as heading

"Executive Summary" as sub-heading

Even though this can be extracted using bold text info, it failed in a lot of cases. That's why looking for APIs.

来源:https://stackoverflow.com/questions/53043279/heading-and-sub-heading-extraction-from-pdf

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!