Web Scraping interactive map (javascript) with R and PhantomJS

你离开我真会死。 提交于 2021-01-28 08:10:29

问题


I am trying to scrape data from an interactive map (looking to get crime data for a county). I am using R (rvest) and trying to use phantomjs too. I'm new to web scraping so I am not really understanding how all the elements work together (trying to get there).

The problem I believe I am having is that after I run the phantomjs and upload the html using R's rvest package, I end up with more scripts and no clear data in the html. My code is below.

writeLines("var url = 'http://www.google.com';
var page = new WebPage();
var fs = require('fs');

page.open(url, function (status) {
    just_wait();
});

function just_wait() {
    setTimeout(function() {
               fs.write('cool.html', page.content, 'w');
            phantom.exit();
    }, 2500);
}
", con = "scrape.js")

A function that takes in the url that I want to scrape

s_scrape <- function(url = "https://gis.adacounty.id.gov/apps/crimemapper/", 
                  js_path = "scrape.js", 
                  phantompath = "/Users/alihoop/Documents/phantomjs/bin/phantomjs"){

# this section will replace the url in scrape.js to whatever you want 
lines <- readLines(js_path)
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, js_path)

command = paste(phantompath, js_path, sep = " ")
system(command)

}

Execute the js_scrape() function and get a html file saved as "cool.html"

js_scrape()

Where I am not understanding what to do next is the below R code:

map_data <- read_html('cool.html') %>%
            html_nodes('script')

The output I get in the HTML via phantomjs is just scripts again. Looking for help on how to proceed when faced (in my mind) is javascript nested in javascript scripts(?)

Thank you!


回答1:


This site uses javascript to make queries to the server. One solution is to reproduce the rest request and read the returning JSON file directly. This avoids the need to use Phantomjs.

From the developer tools options from your browser and looking through the xhr files, you will find a file(s) named "query" with a link similar to: "https://gisapi.adacounty.id.gov/arcgis/rest/services/CrimeMapper/CrimeMapperWAB/FeatureServer/11/query?f=json&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=*&outSR=102100&resultOffset=0&resultRecordCount=1000"

Read this JSON response directly and convert to a list with the use of the jsonlite package:

library(jsonlite)
output<-jsonlite::fromJSON("https://gisapi.adacounty.id.gov/arcgis/rest/services/CrimeMapper/CrimeMapperWAB/FeatureServer/11/query?f=json&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=*&outSR=102100&resultOffset=0&resultRecordCount=1000")
output$features

Find the first number in the link, (11 in this case) "FeatureServer/11/query?f=json". This number will determine which crime to query the server with. I found, it can take a value from 0 to 11. Enter 0 for arson, 4 for drugs, 11 for vandalism, etc.



来源:https://stackoverflow.com/questions/60694209/web-scraping-interactive-map-javascript-with-r-and-phantomjs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!