问题
I am trying to scrape this webpage: https://www.mustardbet.com/sports/events/302698
Since the webpage seems to be rendered dynamically, I am following this tutorial: https://www.datacamp.com/community/tutorials/scraping-javascript-generated-data-with-r#gs.dZEqev8
As the tutorial suggests, I save a file named "scrape_mustard.js" with the following code:
// scrape_mustard.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'mustard.html'
page.open('https://www.mustardbet.com/sports/events/302698', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
Then, I perform
system("./phantomjs scrape_mustard.js")
but I get the error:
ReferenceError: Can't find variable: Set
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
Now, when I paste "https://www.mustardbet.com/assets/js/index.dfd873fb.js" into my browser I can see that it's javascript, and that I probably need to either (1) save that as a file, or (2) include it in scrape_mustard.js.
But if (1), I don't know how to then reference that new file, and if (2), I don't know how to define all that javascript properly so that it can be used.
I'm a complete newbie to javascript, but maybe this problem is not too difficult?
Thanks for your help!
回答1:
I was able to scrape using the js module puppeteer.js.
Download node.js here. node.js comes with npm which makes your life easier when comes to install modules. You need to install puppeteer using npm.
In RStudio, make sure you are on your working directory when you are installing puppeteer.js. Once node.js is installed, do (source):
system("npm i puppeteer")
scrape_mustard.js:
// load modules
const fs = require("fs");
const puppeteer = require("puppeteer");
// page url
url = "https://www.mustardbet.com/sports/events/302698";
scrape = async() => {
const browser = await puppeteer.launch({headless: false}); // open browser
const page = await browser.newPage(); // open new page
await page.goto(url, {waitUntil: "networkidle2", timeout: 0}); // go to page
await page.waitFor(5000); // give it time to load all the javascript rendered content
const html = await page.content(); // copy page contents
browser.close(); // close chromium
return html // return html object
};
scrape().then((value) => {
fs.writeFileSync("./stackoverflow/page.html", value) // write the object being returned by scrape()
});
To run scrape_mustard.js in R:
library(magrittr)
system("node ./stackoverflow/scrape_mustard.js")
html <- xml2::read_html("./stackoverflow/page.html")
oddsMajor <- html %>%
rvest::html_nodes(".odds-major")
betNames <- html %>%
rvest::html_nodes("h3")
Console output:
{xml_nodeset (60)}
[1] <span class="odds-major">2</span>
[2] <span class="odds-major">14</span>
[3] <span class="odds-major">15</span>
[4] <span class="odds-major">16</span>
[5] <span class="odds-major">17</span>
[6] <span class="odds-major">23</span>
[7] <span class="odds-major">25</span>
[8] <span class="odds-major">32</span>
[9] <span class="odds-major">33</span>
[10] <span class="odds-major">39</span>
[11] <span class="odds-major">47</span>
[12] <span class="odds-major">54</span>
[13] <span class="odds-major">55</span>
[14] <span class="odds-major">58</span>
[15] <span class="odds-major">58</span>
[16] <span class="odds-major">64</span>
[17] <span class="odds-major">73</span>
[18] <span class="odds-major">73</span>
[19] <span class="odds-major">92</span>
[20] <span class="odds-major">98</span>
...
> betNames
{xml_nodeset (60)}
[1] <h3>Charles Howell III</h3>\n
[2] <h3>Brian Harman</h3>\n
[3] <h3>Austin Cook</h3>\n
[4] <h3>J.J. Spaun</h3>\n
[5] <h3>Webb Simpson</h3>\n
[6] <h3>Cameron Champ</h3>\n
[7] <h3>Peter Uihlein</h3>\n
[8] <h3>Seung-Jae Im</h3>\n
[9] <h3>Nick Watney</h3>\n
[10] <h3>Graeme McDowell</h3>\n
[11] <h3>Zach Johnson</h3>\n
[12] <h3>Lucas Glover</h3>\n
[13] <h3>Corey Conners</h3>\n
[14] <h3>Luke List</h3>\n
[15] <h3>David Hearn</h3>\n
[16] <h3>Adam Schenk</h3>\n
[17] <h3>Kevin Kisner</h3>\n
[18] <h3>Brian Gay</h3>\n
[19] <h3>Patton Kizzire</h3>\n
[20] <h3>Brice Garnett</h3>\n
...
I am sure it can be done with phantomjs but I've found puppeteer easier to scrape javascript-rendered webpages. Also keep in mind that phantomjs is no longer being developed.
来源:https://stackoverflow.com/questions/53339598/scraping-javascript-rendered-webpage-that-references-external-javascript-scripts