Extract all links from a web page using python

前端 未结 3 561
死守一世寂寞
死守一世寂寞 2020-12-28 11:03

Following Introduction to Computer Science track at Udacity, I\'m trying to make a python script to extract links from page, below is the code I used:

I got the fol

相关标签:
3条回答
  • 2020-12-28 11:37

    page is undefined and that is the cause of error.

    For web scraping like this, you can simply use beautifulSoup:

    from bs4 import BeautifulSoup, SoupStrainer
    import requests
    
    url = "http://stackoverflow.com/"
    
    page = requests.get(url)    
    data = page.text
    soup = BeautifulSoup(data)
    
    for link in soup.find_all('a'):
        print(link.get('href'))
    
    0 讨论(0)
  • 2020-12-28 11:39

    I'm a bit late here, but here's one way to get the links off a given page:

    from html.parser import HTMLParser
    import urllib.request
    
    
    class LinkScrape(HTMLParser):
    
        def handle_starttag(self, tag, attrs):
            if tag == 'a':
                for attr in attrs:
                    if attr[0] == 'href':
                        link = attr[1]
                        if link.find('http') >= 0:
                            print('- ' + link)
    
    
    if __name__ == '__main__':
        url = input('Enter URL > ')
        request_object = urllib.request.Request(url)
        page_object = urllib.request.urlopen(request_object)
        link_parser = LinkScrape()
        link_parser.feed(page_object.read().decode('utf-8'))
    
    0 讨论(0)
  • 2020-12-28 12:00

    You can find all instances of tags that have an attribute containing http in htmlpage. This can be achieved using find_all method from BeautifulSoup and passing attrs={'href': re.compile("http")}

    import re
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(htmlpage, 'html.parser')
    links = []
    for link in soup.find_all(attrs={'href': re.compile("http")}):
        links.append(link.get('href'))
    
    print(links)
    
    0 讨论(0)
提交回复
热议问题