问题
I want to fetch the title of a webpage which I open using urllib2. What is the best way to do this, to parse the html and find what I need (for now only the -tag but might need more in the future).
Is there a good parsing lib for this purpose?
回答1:
Yes I would recommend BeautifulSoup
If you're getting the title it's simply:
soup = BeautifulSoup(html)
myTitle = soup.html.head.title
or
myTitle = soup('title')
Taken from the documentation
It's very robust and will parse the html no matter how messy it is.
回答2:
Try Beautiful Soup:
url = 'http://www.example.com'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
title = soup.html.head.title
print title.contents
回答3:
Why are you guys importing a whole extra library for one task. No regular expressions? wasn't the request for urllib not bs4 or mech which are third party? to do with standard libraries parse the html and match the string then split the '>'
'<'
with re or whateves.
N=(len(html))
for a in html(N):
if '<title>' in a:
Title=(str(a))
thats python 2 I think, you can strip it
回答4:
Use Beautiful Soup.
html = urllib2.urlopen("...").read()
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
print soup.title.string
来源:https://stackoverflow.com/questions/1660302/python-fetching-title