I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:
EDIT: I ended up just getting the link and
It's not exactly magic. I suggest
For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen
or subprocess.call()
.