How to extract data from a PDF file while keeping track of its structure?

后端 未结 6 550
醉话见心
醉话见心 2020-12-12 22:52

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to

6条回答
  •  青春惊慌失措
    2020-12-12 23:36

    PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals.

    Grobid available as a opensource on github.

    https://github.com/kermitt2/grobid

提交回复
热议问题