Following Introduction to Computer Science track at Udacity, I\'m trying to make a python script to extract links from page, below is the code I used:
I got the fol
page
is undefined and that is the cause of error.
For web scraping like this, you can simply use beautifulSoup
:
from bs4 import BeautifulSoup, SoupStrainer
import requests
url = "http://stackoverflow.com/"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
I'm a bit late here, but here's one way to get the links off a given page:
from html.parser import HTMLParser
import urllib.request
class LinkScrape(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'a':
for attr in attrs:
if attr[0] == 'href':
link = attr[1]
if link.find('http') >= 0:
print('- ' + link)
if __name__ == '__main__':
url = input('Enter URL > ')
request_object = urllib.request.Request(url)
page_object = urllib.request.urlopen(request_object)
link_parser = LinkScrape()
link_parser.feed(page_object.read().decode('utf-8'))
You can find all instances of tags that have an attribute containing http in htmlpage
. This can be achieved using find_all
method from BeautifulSoup
and passing attrs={'href': re.compile("http")}
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlpage, 'html.parser')
links = []
for link in soup.find_all(attrs={'href': re.compile("http")}):
links.append(link.get('href'))
print(links)