Issue with html tags while scraping data using beautiful soup

问题

Common piece of code:

# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from itertools import islice

page = urllib2.urlopen('http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html').read()
soup = BeautifulSoup(page)
prices = soup.findAll('div', {"class": "price"})

After this I am trying following codes to get data: Code 1:

for price in prices:
    print unicode(price.string).encode('utf8')

Output1: No Output, code runs without any error and prints nothing.

Code 2:

for price in prices:
    textcontent3= u' '.join(price.stripped_strings)
    if textcontent3:
        print textcontent3

Output2: No output again, same situation as in Output1.

Code 3:

for price in prices:
    fonttag = price.find('div')
    if fonttag is not None:
        print unicode(fonttag.string).encode('utf8').strip()

Output3: No output, same as in Output1

After this I tried printing the concerned part of the html:

Code 4:

print prices

Output4:

</span></div>, <div class="price">
<span id="price"><br/>
</span></div>, <div class="price">
<span id="price"><br/>
</span></div>]

As it can be seen from the Output4, no price value is coming in the html beautiful soup is scraping for me. While on webpage this html structure looks like this:

<div class="price"><span id="price">49,90 €</span><br>einmalig</div>

Beautiful soup is not extracting the price values as mentioned in the html page, thus I am not able to scrape data for the price. Please help me in solving this issue & pardon my ignorance as I am new to programming.

回答1:

The page uses a large JavaScript structure to load the prices. You can load just that structure:

scripts = soup.find_all('script')
script = next(s.text for s in scripts if s.string and 'window.rates' in s.string)
datastring = script.split('phones=')[1].split(';window.')[0]

This results in a large JavaScript structure, starting with:

{sku844082:{name:"Samsung Galaxy SII",image:"/images/m677391_300468.jpg",deliveryTime:"Vorauss. verf&#252;gbar ab Anfang Januar",sku1444291:{p:"prod954312",e:"19.90"},sku1444286:{p:"prod954312",e:"19.90"},sku1444283:{p:"prod954312",e:"39.90"},sku1444275:{p:"prod954312",e:"59.90"},sku1104261:{p:"prod954312",e:"99.90"}},sku894279:{name:"BlackBerry Torch 9810",image:"/images/m727477_300464.jpg",deliveryTime:"Lieferbar innerhalb 48 Stunden",sku1444275:{p:"prod1004495",e:"179.90"},sku1104261:{p:"prod1004495",e:"259.90"},sku1444291:{p:"prod1004495",e:"29.90"},sku1444286:{p:"prod1004495",e:"29.90"},sku1444283:{p:"prod1004495",e:"49.90"}},sku864221:{name:"BlackBerry Bold 9900",image:"/images/m707491_300465.jpg",deliveryTime:"Lieferbar innerhalb 48 Stunden",sku1444275:{p:"prod974431",e:"129.90"},sku1104261:{p:"prod974431",e:"169.90"},sku1444291:{p:"prod974431",e:"49.90"},sku1444286:{p:"prod974431",e:"49.90"},sku1444283:{p:"prod974431",e:"89.90"}}

Unfortunately, that's not directly loadable with the json module; although valid JavaScript, without quoting around the keys it is not valid JSON. You'd need to use regular expressions to clean that up further, or grab the p:"someprice" information directly from that string.

Luckily the structure can be fixed with a small amount of regular expression magic:

import re
import json

datastring = re.sub(ur'([{,])([a-z]\w*):', ur'\1"\2":', datastring)
data = json.loads(datastring)

This gives you a large dictionary, with SKU keys and dictionaries with nested dicts as data, including nested SKUs with p product codes and e prices:

>>> from pprint import pprint
>>> pprint(data['sku864221'])
{u'deliveryTime': u'Lieferbar innerhalb 48 Stunden',
 u'image': u'/images/m707491_300465.jpg',
 u'name': u'BlackBerry Bold 9900',
 u'sku1104261': {u'e': u'169.90', u'p': u'prod974431'},
 u'sku1444275': {u'e': u'129.90', u'p': u'prod974431'},
 u'sku1444283': {u'e': u'89.90', u'p': u'prod974431'},
 u'sku1444286': {u'e': u'49.90', u'p': u'prod974431'},
 u'sku1444291': {u'e': u'49.90', u'p': u'prod974431'}}

来源：https://stackoverflow.com/questions/14121788/issue-with-html-tags-while-scraping-data-using-beautiful-soup

标签

python-2.7

html-parsing

screen-scraping

beautifulsoup

html