发表新帖

发表新帖

Converting a pdf to text/html in python so I can parse it

前端未结

关注

 2  1170

梦毁少年i 2021-02-06 14:28

I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:

EDIT: I ended up just getting the link and

2条回答

遇见更好的自我 (楼主)

2021-02-06 14:38
It's not exactly magic. I suggest
- downloading the PDF file to a temp directory,
- calling out to an external program to extract the text into a (temp) text file,
- reading the text file.
For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen or subprocess.call().
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题