how to scrape product details on amazon webpage using beautifulsoup [closed]

…衆ロ難τιáo~ 提交于 2019-12-23 04:37:10

问题


For webpage: http://www.amazon.com/Harry-Potter-Prisoner-Azkaban-Rowling/dp/0439136369/ref=pd_sim_b_2?ie=UTF8&refRID=1MFBRAECGPMVZC5MJCWG How could I scrape product details and output dict in python. In above case, the dict output I want to have will be:

Age Range: 9 - 12 years
Grade Level: 4 - 7
...
...

I'm new to beautifulsoup and didn't find good example to make this happen. I want to have some example to follow.


回答1:


The idea is to iterate over all Product Details items with the help of table#productDetailsTable div.content ul li CSS selector, then use the bold text as a key and the next sibling as a value:

from pprint import pprint
from bs4 import BeautifulSoup
import requests

url = 'http://www.amazon.com/dp/0439136369'
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'})

soup = BeautifulSoup(response.content)
tags = {}
for li in soup.select('table#productDetailsTable div.content ul li'):
    try:
        title = li.b
        key = title.text.strip().rstrip(':')
        value = title.next_sibling.strip()

        tags[key] = value
    except AttributeError:
        break

pprint(tags)

Prints:

{
    u'Age Range': u'9 - 12 years',
    u'Amazon Best Sellers Rank': u'#1,440 in Books (',
    u'Average Customer Review': u'',
    u'Grade Level': u'4 - 7',
    u'ISBN-10': u'0439136369',
    u'ISBN-13': u'978-0439136365',
    u'Language': u'English',
    u'Lexile Measure': u'880L',
    u'Mass Market Paperback': u'448 pages',
    u'Product Dimensions': u'1.2 x 5.2 x 7.8 inches',
    u'Publisher': u'Scholastic Paperbacks (September 11, 2001)',
    u'Series': u'Harry Potter (Book 3)',
    u'Shipping Weight': u'11.2 ounces ('
}

Note that we are breaking the loop as soon as we hit an AttributeError. It happens on after there is no more bold text inside the li element.




回答2:


from bs4 import BeautifulSoup
import urllib
import urllib2
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}
url = 'http://www.amazon.com/dp/0439136369'
data = urllib.urlencode(headers)
req = urllib2.Request(url,data)
soup = BeautifulSoup(urllib2.urlopen(req).read())
for x in soup.find_all('table',id='productDetailsTable'):
    for tag in x.find_all('li'):
        tag.get_text()

From the above code you can extract the text from the Table, I haven't format it to print or put in dict, as you said you need little help. so what I have done in this above code. I need to change user-agent as amazon was not allowing python user-agent. using find_all i am finding the table with id=productDetailsTable'. then I am looping over it to find all li tag as all information is stored in this tag.



来源:https://stackoverflow.com/questions/26682768/how-to-scrape-product-details-on-amazon-webpage-using-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!