Ignore all data after References - Python

依然范特西╮ 提交于 2021-01-29 17:01:02

问题


I am working on a Python project, where I need to process some PDF research papers' data. I'm able to parse papers, extract data from them and identify sections using PyPDF2.

import PyPDF2

pdfFileObj = open('fileName.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageCount = pdfReader.numPages
count = 0
text = ''

while count < pageCount:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()

Every paper contains References at the end of paper, which I'm able to get, but Some papers also have some data after References. It could be any thing i.e. text/ images/ tables, may or may not start with heading.Check this and this paper as Reference.

Here is some portion How I'm getting References and parsing them but now I've all random data in references, and I'm stuck how to separate references from all extra stuff after that.

Any kind of help will be appreciated.

来源:https://stackoverflow.com/questions/62542857/ignore-all-data-after-references-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!