How can I get href links from HTML using Python?

后端 未结 10 2291
自闭症患者
自闭症患者 2020-11-27 03:25
import urllib2

website = \"WEBSITE\"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

So far so good.

But I wa

10条回答
  •  情歌与酒
    2020-11-27 03:56

    Try with Beautifulsoup:

    from BeautifulSoup import BeautifulSoup
    import urllib2
    import re
    
    html_page = urllib2.urlopen("http://www.yourwebsite.com")
    soup = BeautifulSoup(html_page)
    for link in soup.findAll('a'):
        print link.get('href')
    

    In case you just want links starting with http://, you should use:

    soup.findAll('a', attrs={'href': re.compile("^http://")})
    

    In Python 3 with BS4 it should be:

    from bs4 import BeautifulSoup
    import urllib.request
    
    html_page = urllib.request.urlopen("http://www.yourwebsite.com")
    soup = BeautifulSoup(html_page, "html.parser")
    for link in soup.findAll('a'):
        print(link.get('href'))
    

提交回复
热议问题