regex pattern in python for parsing HTML title tags

后端未结

关注

 4  1460

I am learning to use both the re module and the urllib module in python and attempting to write a simple web scraper. Here\'s the code I\'ve writte

相关标签:

4条回答

伪装坚强ぢ

2020-12-05 20:26
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.

BeautifulSoup example:
```
from bs4 import BeautifulSoup

response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text
```
Since a title tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.

Your specific problem can be solved by matching additional characters within the title tag, optionally:
```
r'<title[^>]*>([^<]+)</title>'
```
This matches 0 or more characters that are not the closing > bracket. The '0 or more' here lets you match both extra attributes and the plain <title> tag.
0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2020-12-05 20:38
If you wish to identify all the htlm tags, you can use this
```
batRegex = re.compile(r'(<[a-z]*>)')
m1=batRegex.search(html)
print batRegex.findall(yourstring)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
轻奢々

2020-12-05 20:39
It is recommended that you use Beautiful Soup or any other parser to parse HTML, but if you badly want regex the following piece of code would do the job.

The regex code:
```
<title.*?>(.+?)</title>
```
How it works:

Produces:
```
['Google']
['Welcome to Facebook - Log In, Sign Up or Learn More']
['reddit: the front page of the internet']
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

予麋鹿

2020-12-05 20:44

You could scrape a bunch of titles with a couple lines of gazpacho:

from gazpacho import Soup

urls = ["http://google.com", "https://facebook.com", "http://reddit.com"]

titles = []
for url in urls:
    soup = Soup.get(url)
    title = soup.find("title", mode="first").text
    titles.append(title)

This will output:

titles
['Google',
 'Facebook - Log In or Sign Up',
 'reddit: the front page of the internet']

0 讨论(0)