发表新帖

发表新帖

How can I get href links from HTML using Python?

后端未结

关注

 10  2316

自闭症患者 2020-11-27 03:25

import urllib2

website = \"WEBSITE\"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

So far so good.

But I wa

10条回答

不知归路 (楼主)

2020-11-27 03:51
Using BS4 for this specific task seems overkill.

Try instead:
```
website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
html = website.read()
files = re.findall('href="(.*tgz|.*tar.gz)"', html)
print sorted(x for x in (files))
```
I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-findall and works for me quite well.

I tested it only on my scenario of extracting a list of files from a web folder that exposes the files\folder in it, e.g.:

and I got a sorted list of the files\folders under the URL
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...

热议问题