Regex within html tags | 易学教程

问题

I would like to parse the HD price from the following snipper of HTML. I am only have fragments of the html code, so I cannot use an HTML parser for this.

<div id="left-stack">        
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>

Basically, the format would be to "Find the price before the word "HD Version" (case insensitive). Here is what I have so far:

re.match(r'^(\d|.){1,6}...HD\sVersion', string)

How would I extract the value "19.99" from the above string?

回答1:

BeautifulSoup is very lenient to the HTML it parses, you can use it for the chunks/parts of HTML too:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

data = u"""
<div id="left-stack">
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>
"""

soup = BeautifulSoup(data)
print soup.find('span', class_='price').text[1:]

Prints:

19.99

回答2:

You've asked for a regular expression here, but it's not the right tool for parsing HTML. Use BeautifulSoup for this.

>>> from bs4 import BeautifulSoup
>>> html = '''
<div id="left-stack">        
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>'''
>>> soup = BeautifulSoup(html)
>>> val  = soup.find('span', {'class':'price'}).text
>>> print val[1:]
19.99

回答3:

You can still parse using BeautifulSoup, you don't need the full html:

from bs4 import BeautifulSoup
html="""
<div id="left-stack">
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>
"""

soup = BeautifulSoup(html)
sp = soup.find(attrs={"class":"price"}) 
print sp.text[1:]
19.99

回答4:

The current BeautifulSoup answers only show how to grab all <span class="price"> tags. This is better:

from bs4 import BeautifulSoup

soup = """<div id="left-stack">        
 <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>"""

for HD_Version in (tag for tag in soup('li') if tag.text.lower() == 'hd version'):
    price = HD_Version.parent.findPreviousSibling('span', attrs={'class':'price'}).text

In general, using regular expressions to parse an irregular language like HTML is asking for trouble. Stick with an established parser.