问题
I want to scrape historical results of South African LOTTO draws (especially Total Pool Size, Total Sales, etc.) from the South African National Lottery website. By default one sees links to results for the last ten draws, or one can select a date range to pull up a larger set of links to draws (which will still display only ten per page).
Hovering in the browser over a link e.g. 'LOTTO DRAW 2012' we see javascript:void();
so it is clear that the draw results will be rendered using Javascript. Reading advice on an R Web Scraping Cheat Sheet, I realized that I needed to open Google Chrome Developer tools, then open Network tab, and then click the link to the draw 'LOTTO DRAW 2012'. When I did so, I could see that this url is being called with an initiator
When I right-click on the initiator and select 'Copy Response', I can see the data I need inside a 'drawDetails' object in what appears to be JSON code.
{"code":200,"message":"OK","data":{"drawDetails":{"drawNumber":"2012","drawDate":"2020\/04\/11","nextDrawDate":"2020\/04\/15","ball1":"48","ball2":"6","ball3":"43","ball4":"41","ball5":"25","ball6":"45","bonusBall":"38","div1Winners":"1","div1Payout":"10546013.8","div2Winners":"0","div2Payout":"0","div3Winners":"28","div3Payout":"7676.4","div4Winners":"62","div4Payout":"2751.4","div5Winners":"1389","div5Payout":"206.3","div6Winners":"1872","div6Payout":"133","div7Winners":"28003","div7Payout":"50","div8Winners":"20651","div8Payout":"20","rolloverAmount":"0","rolloverNumber":"0","totalPrizePool":"13280236.5","totalSales":"11610950","estimatedJackpot":"2000000","guaranteedJackpot":"0","drawMachine":"RNG2","ballSet":"RNG","status":"published","winners":52006,"millionairs":1,"gpwinners":"52006","wcwinners":"0","ncwinners":"0","ecwinners":"0","mpwinners":"0","lpwinners":"0","fswinners":"0","kznwinners":"0","nwwinners":"0"},"totalWinnerRecord":{"lottoMillionairs":28716702,"lottoWinners":337285646,"ithubaMillionairs":135763,"ithubaWinners":305615802}},"videoData":[{"id":"1049","listid":"1","parentid":"1","videosource":"youtube","videoid":"chHfFxVi9QI","imageurl":"","title":"LOTTO, LOTTO PLUS 1 AND LOTTO PLUS 2 DRAW 2012 (11 APRIL 2020)","description":"","custom_imageurl":"","custom_title":"","custom_description":"","specialparams":"","lastupdate":"0000-00-00 00:00:00","allowupdates":"1","status":"0","isvideo":"1","link":"https:\/\/www.youtube.com\/watch?v=chHfFxVi9QI","ordering":"10001","publisheddate":"2020-04-11 20:06:17","duration":"182","rating_average":"0","rating_max":"0","rating_min":"0","rating_numRaters":"0","statistics_favoriteCount":"0","statistics_viewCount":"329","keywords":"","startsecond":"0","endsecond":"0","likes":"6","dislikes":"0","commentcount":"0","channel_username":"","channel_title":"","channel_subscribers":"9880","channel_subscribed":"0","channel_location":"","channel_commentcount":"0","channel_viewcount":"0","channel_videocount":"1061","channel_description":"","channel_totaluploadviews":"0","alias":"lotto-lotto-plus-1-and-lotto-plus-2-draw-2012-11-april-2020","rawdata":"","datalink":"https:\/\/www.googleapis.com\/youtube\/v3\/videos?id=chHfFxVi9QI&part=id,snippet,contentDetails,statistics&key=AIzaSyC1Xvk2GUdb_N3UiFtjsgZ-uMviJ_8MFZI"}]}
It is a POST type request, and so I tried to follow this answer, but cannot find onclick
values indicating the data submitted with the form. Moreover, the request URL for 'LOTTO DRAW 2012' is identical to that for 'LOTTO DRAW 2011', so there is no unique identifier for the particular draw being passed with the URL itself. Thus it is not clear to me how the unique request for the results of a particular draw is made.
Hence, the smaller question is, given a particular LOTTO draw number or draw date, how does one find out the unique identifier that is used to make the POST request for the data pertaining to that draw specifically?
The larger question is, if one is able to obtain such unique identifiers for all the historical draws, how can one generate the JSON drawDetails object for all the historical draws in turn, or otherwise complete the scraping operation?
回答1:
You are right - the contents on the page are updated by javascript via an ajax request. The server returns a json string in response to an http POST request. With POST requests, the server's response is determined not only by the url you request, but by the body of the message you send to the server. In this case, your body is a simple form with 3 fields: gameName
, which is always LOTTO
, isAjax
which is always true
, and drawNumber
, which is the field you want to vary.
If you are using httr
, you specify these fields as a named list in the body
parameter of the POST
function.
Once you have the response for each draw, you will want to parse the json into an R-friendly format such as a list or data frame using a library such as jsonlite
. From looking at the structure of this particular json, it makes most sense to extract the component $data$drawDetails
and make that a one-row dataframe. This will allow you to bind several draws together into a single data frame.
Here is a function that does all that for you:
lotto_details <- function(draw_numbers)
{
do.call("rbind", lapply(draw_numbers, function(x)
{
res <- httr::POST(paste0("https://www.nationallottery.co.za/index.php",
"?task=results.redirectPageURL&",
"Itemid=265&option=com_weaver&",
"controller=lotto-history"),
body = list(gameName = "LOTTO", drawNumber = x, isAjax = "true"))
as.data.frame(jsonlite::fromJSON(httr::content(res, "text"))$data$drawDetails)
}))
}
Which you use like this:
lotto_details(2009:2012)
#> drawNumber drawDate nextDrawDate ball1 ball2 ball3 ball4 ball5 ball6
#> 1 2009 2020/04/01 2020/04/04 51 15 7 32 42 45
#> 2 2010 2020/04/04 2020/04/08 43 4 21 24 10 3
#> 3 2011 2020/04/08 2020/04/11 42 43 8 18 2 29
#> 4 2012 2020/04/11 2020/04/15 48 6 43 41 25 45
#> bonusBall div1Winners div1Payout div2Winners div2Payout div3Winners
#> 1 1 0 0 0 0 21
#> 2 22 0 0 0 0 31
#> 3 34 0 0 0 0 21
#> 4 38 1 10546013.8 0 0 28
#> div3Payout div4Winners div4Payout div5Winners div5Payout div6Winners
#> 1 8455.3 60 2348.7 1252 189 1786
#> 2 6004.3 71 2080.6 1808 137.3 2352
#> 3 8584.5 60 2384.6 1405 171.1 2079
#> 4 7676.4 62 2751.4 1389 206.3 1872
#> div6Payout div7Winners div7Payout div8Winners div8Payout rolloverAmount
#> 1 115.2 24664 50 19711 20 3809758.17
#> 2 91.7 35790 50 25981 20 5966533.86
#> 3 100.5 27674 50 21895 20 8055430.87
#> 4 133 28003 50 20651 20 0
#> rolloverNumber totalPrizePool totalSales estimatedJackpot
#> 1 2 6198036.67 9879655 6000000
#> 2 3 9073426.56 11696905 8000000
#> 3 4 10649716.37 10406895 10000000
#> 4 0 13280236.5 11610950 2000000
#> guaranteedJackpot drawMachine ballSet status winners millionairs
#> 1 0 RNG2 RNG published 47494 0
#> 2 0 RNG2 RNG published 66033 0
#> 3 0 RNG2 RNG published 53134 0
#> 4 0 RNG2 RNG published 52006 1
#> gpwinners wcwinners ncwinners ecwinners mpwinners lpwinners fswinners
#> 1 47494 0 0 0 0 0 0
#> 2 66033 0 0 0 0 0 0
#> 3 53134 0 0 0 0 0 0
#> 4 52006 0 0 0 0 0 0
#> kznwinners nwwinners
#> 1 0 0
#> 2 0 0
#> 3 0 0
#> 4 0 0
Created on 2020-04-13 by the reprex package (v0.3.0)
回答2:
The question already has a satisfactory answer (see above) that I've accepted. I simultaneously arrived at a nearly identical solution; I add it here only because it explicitly covers the full range of available draw numbers and will automatically detect the most recent draw number so that the code can be run 'as is' in the future, provided the National Lottery website design remains the same.
theurl <- "https://www.nationallottery.co.za/index.php?task=results.redirectPageURL&Itemid=265&option=com_weaver&controller=lotto-history"
x <- rvest::html_text(xml2::read_html(theurl))
preceding_string <- "LOTTO, LOTTO PLUS 1 AND LOTTO PLUS 2 DRAW "
drawnums <- as.integer(vapply(gregexpr(preceding_string, x)[[1]] + nchar(preceding_string),
function(k) substr(x, start = k, stop = k + 3), NA_character_))
drawnumrange <- 1506:max(drawnums)
response <- lapply(drawnumrange, function(d) httr::POST(url = theurl,
body = list(gameName = "LOTTO", drawNumber = as.character(d), isAjax =
"true"), encode = "form"))
jsondat <- lapply(response, function(r) jsonlite::parse_json(r)$data$drawDetails)
lottotable <- as.data.frame(do.call(rbind, jsondat))
numericcols <- c(1, 4:32, 36:37)
lottotable[numericcols] <- sapply(lottotable[numericcols], as.numeric)
xlsx::write.xlsx2(lottotable[1:37], "lottotable.xlsx", row.names = FALSE)
来源:https://stackoverflow.com/questions/61187053/scraping-javascript-rendered-content-in-r-from-a-webpage-without-unique-url