scraping xml/javascript table with R [closed]

半城伤御伤魂 提交于 2019-12-06 09:52:12

问题


I want to scrape a table like this http://www.oddsportal.com//hockey/usa/nhl/carolina-hurricanes-ottawa-senators-80YZhBGC/ I'd want to scrape the bookmakers and the odds. The problem is I don't know what kind of a table that is nor how to scrape it.

These threads might be able to help me (Scraping javascript with R or What type of HTML table is this and what type of webscraping techniques can you use?) but I'd appreciate if someone could point me in the right direction or better yet give instructions here.

So what kind of a table is that odds table, is it possible to scrape it with R and if so, how?

Edit: I should have been more clear. I have scraped data with R for some time now and probably dont need help with basics. After further inspection that table is indeed Javascript and that is the problem and what I need help with


回答1:


You can use Selenium and RSelenium to get the relevant data:

library(RSelenium)
appURL <- "http://www.oddsportal.com//hockey/usa/nhl/carolina-hurricanes-ottawa-senators-80YZhBGC"
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(appURL)
tblSource <- remDr$executeScript("return tbls[0].outerHTML;")[[1]]
readHTMLTable(tblSource)
> readHTMLTable(tblSource)
$`NULL`
Bookmakers    1    X    2 Payout 
1    bet-at-home  2.25 3.80 2.60  91.6% 
2        Â bet365Â Â 2.29 3.79 2.64  92.7% 
3        Betsson  2.35 3.75 2.65  93.5% 
4           bwin  2.30 3.75 2.70  93.3% 
5    MarathonBet  2.35 3.80 2.78  95.4% 
6       Titanbet  2.30 3.95 2.50  91.9% 
7        TonyBet  2.35 3.70 2.70  93.8% 
8         Unibet  2.35 3.85 2.60  93.5% 
9   William Hill  2.30 3.90 2.50  91.6% 
10        Winner  2.30 3.95 2.50  91.9% 
11        youwin  2.40 3.75 2.55  93.0% 



回答2:


The "bookies" data comes from a request for a javascript callback resource:

GET /x/bookies-140619144601-1403252087.js HTTP/1.1
Host: rb.oddsportal.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:30.0) Gecko/20100101 Firefox/30.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.oddsportal.com//hockey/usa/nhl/carolina-hurricanes-ottawa-senators-80YZhBGC/
Connection: keep-alive

it returns a callback resource that has the bookie info, but no odds. There are other callback AJAX calls for the data, but you'll have to dig.

Burp Proxy is a great way to see the URI calls, but the DOM inspection (as @Spacedman suggested) should always be your first line of investigation.



来源:https://stackoverflow.com/questions/24327980/scraping-xml-javascript-table-with-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!