Scrape a table looping in specific dates using Beautiful Soup

半腔热情 提交于 2019-12-23 02:53:11

问题


I have been driving myself up the wall with trying to scrape the necessary historical coffee prices from the table found here using BeautifulSoup: http://www.investing.com/commodities/us-coffee-c-historical-data

I am trying to pull a market weeks worth of prices from 04-04-16 to 04-08-2016.

My ultimate goal is to scrape the entire table for those dates. Pulling all columns from Date to Change %.

My first step was to create a dictionary of the dates I want, using the date format of used in the element:

dates={1 : "Apr 04, 2016",
  2 : "Apr 05, 2016",
  3 : "Apr 06, 2016",
  4 : "Apr 07, 2016",
  5 : "Apr 08, 2016"}
dates

Next I want to scrape the table but I can't get it to do what I need to where it loops the dates in as needed so I have tried to pull the individual elements:

import requests
from bs4 import BeautifulSoup

url = "http://www.investing.com/commodities/us-coffee-c-historical-data"
page  = requests.get(url).text
soup_coffee = BeautifulSoup(page)

coffee_table = soup_coffee.find("table", class_="genTbl closedTbl historicalTbl")
coffee_titles = coffee_table.find_all("th", class_="noWrap")

for coffee_title in coffee_titles:
  price = coffee_title.find("td", class_="greenfont")
  print(price)

except the value that is returned is:

None
None
None
None
None
None
None

Firstly, why am I returning a "None" value? I have a feeling it has to do with the coffee_titles part of my code, and it is not recognizing the column titles correctly.

Secondly, is there an efficient way for me to scrape the entire table using my date range in the dates dictionary?

Any suggestions would be greatly appreciated.


回答1:


Your code fails as you are looking for td tags in the headers tags, if you print coffee_titles, it is pretty clear why you see None:

[<th class="first left noWrap">Date</th>, <th class="noWrap">Price</th>, <th class="noWrap">Open</th>, <th class="noWrap">High</th>, <th class="noWrap">Low</th>, <th class="noWrap">Vol.</th>, <th class="noWrap">Change %</th>]

There are no td tags.

To get all the table data, you can pull the dates from the table and use them as keys:

from bs4 import BeautifulSoup
from collections import OrderedDict

r = requests.get("http://www.investing.com/commodities/us-coffee-c-historical-data")
od = OrderedDict()
soup = BeautifulSoup(r.content,"lxml")

# select the table
table = soup.select_one("table.genTbl.closedTbl.historicalTbl")

# all col names
cols = [th.text for th in table.select("th")[1:]]
# get all rows bar the first i.e the headers
for row in table.select("tr + tr"):
    # get all the data including the date
    data = [td.text for td in row.select("td")]
    # use date as the key and store list of values
    od[data[0]] = dict(zip(cols,  data[1:]))


from  pprint import pprint as pp

pp(dict(od))

Output:

    {u'Jun 01, 2016': {u'Change %': u'0.29%',
                   u'High': u'123.10',
                   u'Low': u'120.85',
                   u'Open': u'121.50',
                   u'Price': u'121.90',
                   u'Vol.': u'18.55K'},
 u'Jun 02, 2016': {u'Change %': u'0.90%',
                   u'High': u'124.40',
                   u'Low': u'122.15',
                   u'Open': u'122.50',
                   u'Price': u'123.00',
                   u'Vol.': u'22.11K'},
 u'Jun 03, 2016': {u'Change %': u'3.33%',
                   u'High': u'127.40',
                   u'Low': u'122.50',
                   u'Open': u'122.60',
                   u'Price': u'127.10',
                   u'Vol.': u'28.47K'},
 u'Jun 06, 2016': {u'Change %': u'3.62%',
                   u'High': u'132.05',
                   u'Low': u'127.10',
                   u'Open': u'127.30',
                   u'Price': u'131.70',
                   u'Vol.': u'30.65K'},
 u'May 09, 2016': {u'Change %': u'2.49%',
                   u'High': u'126.60',
                   u'Low': u'123.28',
                   u'Open': u'125.65',
                   u'Price': u'126.53',
                   u'Vol.': u'-'},
 u'May 10, 2016': {u'Change %': u'0.29%',
                   u'High': u'125.90',
                   u'Low': u'125.90',
                   u'Open': u'125.90',
                   u'Price': u'126.90',
                   u'Vol.': u'0.01K'},
 u'May 11, 2016': {u'Change %': u'2.26%',
                   u'High': u'129.77',
                   u'Low': u'126.88',
                   u'Open': u'128.60',
                   u'Price': u'129.77',
                   u'Vol.': u'-'},
 u'May 12, 2016': {u'Change %': u'-1.21%',
                   u'High': u'128.75',
                   u'Low': u'127.30',
                   u'Open': u'128.75',
                   u'Price': u'128.20',
                   u'Vol.': u'0.01K'},
 u'May 13, 2016': {u'Change %': u'0.47%',
                   u'High': u'127.85',
                   u'Low': u'127.80',
                   u'Open': u'127.85',
                   u'Price': u'128.80',
                   u'Vol.': u'0.01K'},
 u'May 16, 2016': {u'Change %': u'3.03%',
                   u'High': u'131.95',
                   u'Low': u'128.75',
                   u'Open': u'128.75',
                   u'Price': u'132.70',
                   u'Vol.': u'0.01K'},
 u'May 17, 2016': {u'Change %': u'-0.64%',
                   u'High': u'132.60',
                   u'Low': u'132.60',
                   u'Open': u'132.60',
                   u'Price': u'131.85',
                   u'Vol.': u'-'},
 u'May 18, 2016': {u'Change %': u'-1.93%',
                   u'High': u'129.65',
                   u'Low': u'128.15',
                   u'Open': u'128.85',
                   u'Price': u'129.30',
                   u'Vol.': u'0.02K'},
 u'May 19, 2016': {u'Change %': u'-4.14%',
                   u'High': u'129.00',
                   u'Low': u'123.70',
                   u'Open': u'128.95',
                   u'Price': u'123.95',
                   u'Vol.': u'29.69K'},
 u'May 20, 2016': {u'Change %': u'0.61%',
                   u'High': u'125.95',
                   u'Low': u'124.25',
                   u'Open': u'124.75',
                   u'Price': u'124.70',
                   u'Vol.': u'15.54K'},
 u'May 23, 2016': {u'Change %': u'-2.04%',
                   u'High': u'124.70',
                   u'Low': u'122.00',
                   u'Open': u'124.50',
                   u'Price': u'122.15',
                   u'Vol.': u'15.89K'},
 u'May 24, 2016': {u'Change %': u'-0.29%',
                   u'High': u'123.30',
                   u'Low': u'121.55',
                   u'Open': u'122.45',
                   u'Price': u'121.80',
                   u'Vol.': u'15.06K'},
 u'May 25, 2016': {u'Change %': u'-0.33%',
                   u'High': u'122.95',
                   u'Low': u'121.20',
                   u'Open': u'122.45',
                   u'Price': u'121.40',
                   u'Vol.': u'18.11K'},
 u'May 26, 2016': {u'Change %': u'0.08%',
                   u'High': u'122.15',
                   u'Low': u'121.20',
                   u'Open': u'121.90',
                   u'Price': u'121.50',
                   u'Vol.': u'19.27K'},
 u'May 27, 2016': {u'Change %': u'-0.16%',
                   u'High': u'122.35',
                   u'Low': u'120.80',
                   u'Open': u'122.10',
                   u'Price': u'121.30',
                   u'Vol.': u'13.52K'},
 u'May 31, 2016': {u'Change %': u'0.21%',
                   u'High': u'123.90',
                   u'Low': u'121.35',
                   u'Open': u'121.55',
                   u'Price': u'121.55',
                   u'Vol.': u'23.62K'}}

Now to get specific dates, we need to mimic and ajax call with a post to http://www.investing.com/instruments/HistoricalDataAjax:

from bs4 import BeautifulSoup
from collections import OrderedDict

# data to post
data = {"action": "historical_data",
        "curr_id": "8832",
        "st_date": "04/04/2016",
        "end_date": "04/08/2016",
        "interval_sec": "Daily"}

# add a user agent and specify that we are making an ajax request
head = {

        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"}

with requests.Session() as s:
    r = s.post("http://www.investing.com/instruments/HistoricalDataAjax", data=data, headers=head)
    od = OrderedDict()
    soup = BeautifulSoup(r.content, "lxml")

    table = soup.select_one("table.genTbl.closedTbl.historicalTbl")
       cols = [th.text for th in table.select("th")][1:]
    for row in table.select("tr + tr"):
        data = [td.text for td in row.select("td")]
        od[data[0]] = dict(zip(cols, data[1:]))

from pprint import pprint as pp

pp(dict(od))

Now we only get the date range from st_date to end_date:

{u'Apr 04, 2016': {u'Change %': u'-3.50%',
                   u'High': u'126.55',
                   u'Low': u'122.30',
                   u'Open': u'125.80',
                   u'Price': u'122.80',
                   u'Vol.': u'25.18K'},
 u'Apr 05, 2016': {u'Change %': u'-1.55%',
                   u'High': u'122.85',
                   u'Low': u'120.55',
                   u'Open': u'122.85',
                   u'Price': u'120.90',
                   u'Vol.': u'25.77K'},
 u'Apr 06, 2016': {u'Change %': u'0.50%',
                   u'High': u'122.15',
                   u'Low': u'120.00',
                   u'Open': u'121.45',
                   u'Price': u'121.50',
                   u'Vol.': u'17.94K'},
 u'Apr 07, 2016': {u'Change %': u'-1.40%',
                   u'High': u'122.60',
                   u'Low': u'119.60',
                   u'Open': u'122.35',
                   u'Price': u'119.80',
                   u'Vol.': u'32.69K'}}

You can see the post requests in chrome developer tools under the xhr tab:



来源:https://stackoverflow.com/questions/37668133/scrape-a-table-looping-in-specific-dates-using-beautiful-soup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!