Extract all links from a web page using python

前端未结

关注

 3  561

Following Introduction to Computer Science track at Udacity, I\'m trying to make a python script to extract links from page, below is the code I used:

I got the fol

相关标签:

3条回答

独厮守ぢ

2020-12-28 11:37

page is undefined and that is the cause of error.

For web scraping like this, you can simply use beautifulSoup:

from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "http://stackoverflow.com/"

page = requests.get(url)    
data = page.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    print(link.get('href'))

0 讨论(0)

孤街浪徒

2020-12-28 11:39

I'm a bit late here, but here's one way to get the links off a given page:

from html.parser import HTMLParser
import urllib.request


class LinkScrape(HTMLParser):

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for attr in attrs:
                if attr[0] == 'href':
                    link = attr[1]
                    if link.find('http') >= 0:
                        print('- ' + link)


if __name__ == '__main__':
    url = input('Enter URL > ')
    request_object = urllib.request.Request(url)
    page_object = urllib.request.urlopen(request_object)
    link_parser = LinkScrape()
    link_parser.feed(page_object.read().decode('utf-8'))

0 讨论(0)

醉酒成梦

2020-12-28 12:00
You can find all instances of tags that have an attribute containing http in htmlpage. This can be achieved using find_all method from BeautifulSoup and passing attrs={'href': re.compile("http")}
```
import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlpage, 'html.parser')
links = []
for link in soup.find_all(attrs={'href': re.compile("http")}):
    links.append(link.get('href'))

print(links)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...