问题
I have to write a web crawler in Python. I don't know how to parse a page and extract the URLs from HTML. Where should I go and study to write such a program?
In other words, is there a simple python program which can be used as a template for a generic web crawler? Ideally it should use modules which are relatively simple to use and it should include plenty of comments to describe what each line of code is doing.
回答1:
Look at example code below. The script extracts html code of a web page (here Python home page) and extracts all the links in that page. Hope this helps.
#!/usr/bin/env python
import requests
from bs4 import BeautifulSoup
url = "http://www.python.org"
response = requests.get(url)
# parse html
page = str(BeautifulSoup(response.content))
def getURL(page):
"""
:param page: html of web page (here: Python home page)
:return: urls in that page
"""
start_link = page.find("a href")
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1: end_quote]
return url, end_quote
while True:
url, n = getURL(page)
page = page[n:]
if url:
print(url)
else:
break
Output:
/
#left-hand-navigation
#content-body
/search
/about/
/news/
/doc/
/download/
/getit/
/community/
/psf/
http://docs.python.org/devguide/
/about/help/
http://pypi.python.org/pypi
/download/releases/2.7.3/
http://docs.python.org/2/
/ftp/python/2.7.3/python-2.7.3.msi
/ftp/python/2.7.3/Python-2.7.3.tar.bz2
/download/releases/3.3.0/
http://docs.python.org/3/
/ftp/python/3.3.0/python-3.3.0.msi
/ftp/python/3.3.0/Python-3.3.0.tar.bz2
/community/jobs/
/community/merchandise/
/psf/donations/
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/Languages
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.psfmember.org
...
回答2:
You can use BeautifulSoup as many have also stated. It can parse HTML,XML etc. To see some of it's features, see here.
Example:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.google.co.in/'
conn = urllib2.urlopen(url)
html = conn.read()
soup = BeautifulSoup(html)
links = soup.find_all('a')
for tag in links:
link = tag.get('href',None)
if link is not None:
print link
回答3:
import sys
import re
import urllib2
import urlparse
tocrawl = set(["http://www.facebook.com/"])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
while 1:
try:
crawling = tocrawl.pop()
print crawling
except KeyError:
raise StopIteration
url = urlparse.urlparse(crawling)
try:
response = urllib2.urlopen(crawling)
except:
continue
msg = response.read()
startPos = msg.find('<title>')
if startPos != -1:
endPos = msg.find('</title>', startPos+7)
if endPos != -1:
title = msg[startPos+7:endPos]
print title
keywordlist = keywordregex.findall(msg)
if len(keywordlist) > 0:
keywordlist = keywordlist[0]
keywordlist = keywordlist.split(", ")
print keywordlist
links = linkregex.findall(msg)
crawled.add(crawling)
for link in (links.pop(0) for _ in xrange(len(links))):
if link.startswith('/'):
link = 'http://' + url[1] + link
elif link.startswith('#'):
link = 'http://' + url[1] + url[2] + link
elif not link.startswith('http'):
link = 'http://' + url[1] + '/' + link
if link not in crawled:
tocrawl.add(link)
Referenced to: Python Web Crawler in Less Than 50 Lines (Slow or no longer works, does not load for me)
回答4:
You can use beautifulsoup. Follow the documentation and see what matches your requirements. The documentation contains code snippets for how to extract URL's as well.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find_all('a') # Finds all hrefs from the html doc.
回答5:
With parsing pages, check out the BeautifulSoup module. It's simple to use and allows you to parse pages with HTML. You can extract URLs from the HTML simply by doing str.find('a')
Don't use regular expressions for parsing HTML
来源:https://stackoverflow.com/questions/15517483/how-to-extract-urls-from-an-html-page-in-python