How can I get href links from HTML using Python?

后端 未结 10 2316
自闭症患者
自闭症患者 2020-11-27 03:25
import urllib2

website = \"WEBSITE\"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

So far so good.

But I wa

10条回答
  •  不知归路
    2020-11-27 03:51

    Using BS4 for this specific task seems overkill.

    Try instead:

    website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
    html = website.read()
    files = re.findall('href="(.*tgz|.*tar.gz)"', html)
    print sorted(x for x in (files))
    

    I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-findall and works for me quite well.

    I tested it only on my scenario of extracting a list of files from a web folder that exposes the files\folder in it, e.g.:

    and I got a sorted list of the files\folders under the URL

提交回复
热议问题