问题
Common piece of code:
# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from itertools import islice
page = urllib2.urlopen('http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html').read()
soup = BeautifulSoup(page)
prices = soup.findAll('div', {"class": "price"})
After this I am trying following codes to get data: Code 1:
for price in prices:
print unicode(price.string).encode('utf8')
Output1: No Output, code runs without any error and prints nothing.
Code 2:
for price in prices:
textcontent3= u' '.join(price.stripped_strings)
if textcontent3:
print textcontent3
Output2: No output again, same situation as in Output1.
Code 3:
for price in prices:
fonttag = price.find('div')
if fonttag is not None:
print unicode(fonttag.string).encode('utf8').strip()
Output3: No output, same as in Output1
After this I tried printing the concerned part of the html:
Code 4:
print prices
Output4:
</span></div>, <div class="price">
<span id="price"><br/>
</span></div>, <div class="price">
<span id="price"><br/>
</span></div>]
As it can be seen from the Output4, no price value is coming in the html beautiful soup is scraping for me. While on webpage this html structure looks like this:
<div class="price"><span id="price">49,90 €</span><br>einmalig</div>
Beautiful soup is not extracting the price values as mentioned in the html page, thus I am not able to scrape data for the price. Please help me in solving this issue & pardon my ignorance as I am new to programming.
回答1:
The page uses a large JavaScript structure to load the prices. You can load just that structure:
scripts = soup.find_all('script')
script = next(s.text for s in scripts if s.string and 'window.rates' in s.string)
datastring = script.split('phones=')[1].split(';window.')[0]
This results in a large JavaScript structure, starting with:
{sku844082:{name:"Samsung Galaxy SII",image:"/images/m677391_300468.jpg",deliveryTime:"Vorauss. verfügbar ab Anfang Januar",sku1444291:{p:"prod954312",e:"19.90"},sku1444286:{p:"prod954312",e:"19.90"},sku1444283:{p:"prod954312",e:"39.90"},sku1444275:{p:"prod954312",e:"59.90"},sku1104261:{p:"prod954312",e:"99.90"}},sku894279:{name:"BlackBerry Torch 9810",image:"/images/m727477_300464.jpg",deliveryTime:"Lieferbar innerhalb 48 Stunden",sku1444275:{p:"prod1004495",e:"179.90"},sku1104261:{p:"prod1004495",e:"259.90"},sku1444291:{p:"prod1004495",e:"29.90"},sku1444286:{p:"prod1004495",e:"29.90"},sku1444283:{p:"prod1004495",e:"49.90"}},sku864221:{name:"BlackBerry Bold 9900",image:"/images/m707491_300465.jpg",deliveryTime:"Lieferbar innerhalb 48 Stunden",sku1444275:{p:"prod974431",e:"129.90"},sku1104261:{p:"prod974431",e:"169.90"},sku1444291:{p:"prod974431",e:"49.90"},sku1444286:{p:"prod974431",e:"49.90"},sku1444283:{p:"prod974431",e:"89.90"}}
Unfortunately, that's not directly loadable with the json
module; although valid JavaScript, without quoting around the keys it is not valid JSON. You'd need to use regular expressions to clean that up further, or grab the p:"someprice"
information directly from that string.
Luckily the structure can be fixed with a small amount of regular expression magic:
import re
import json
datastring = re.sub(ur'([{,])([a-z]\w*):', ur'\1"\2":', datastring)
data = json.loads(datastring)
This gives you a large dictionary, with SKU keys and dictionaries with nested dicts as data, including nested SKUs with p
product codes and e
prices:
>>> from pprint import pprint
>>> pprint(data['sku864221'])
{u'deliveryTime': u'Lieferbar innerhalb 48 Stunden',
u'image': u'/images/m707491_300465.jpg',
u'name': u'BlackBerry Bold 9900',
u'sku1104261': {u'e': u'169.90', u'p': u'prod974431'},
u'sku1444275': {u'e': u'129.90', u'p': u'prod974431'},
u'sku1444283': {u'e': u'89.90', u'p': u'prod974431'},
u'sku1444286': {u'e': u'49.90', u'p': u'prod974431'},
u'sku1444291': {u'e': u'49.90', u'p': u'prod974431'}}
来源:https://stackoverflow.com/questions/14121788/issue-with-html-tags-while-scraping-data-using-beautiful-soup