Using rvest to grab data returns No matches

梦想的初衷 提交于 2019-11-28 12:45:08

问题


I'm trying to grab some election results from politco's website using rvest.

http://www.politico.com/2016-election/results/map/president/wisconsin/

I couldn't pull all the data on the page at once, so I went for a county-level approach. Each county has a unique css selector (e.g Adams County's is: '#countyAdams .results-table'). So I grabbed all the county names from elsewhere and set up a quick loop (yes I know loops are bad practice in R but I anticipated this method taking me about 3 minutes).

Grab the URL

wiscoSixteen <- read_html("http://www.politico.com/2016-election/results/map/president/wisconsin")

Create an empty data.frame (and no I didn't pre-define the columns)

stateDf <- NULL

Get the list of counties (this isn't complete but to get to the point the routine breaks we don't need all 70 counties)

wiscoCounties <- c("Adams", "Ashland", "Barron", "Bayfield", "Brown", "Buffalo", "Burnett", "Calumet", "Chippewa", "Clark", "Columbia", "Crawford", "Dane", "Dodge", "Door", "Douglas", "Dunn", "Eau Claire", "Florence", "Fond du Lac", "Forest", "Grant", "Green", "Green Lake", "Iowa", "Iron", "Jackson", "Jefferson", "Juneau")

My 'for' loop:

for (i in 1:length(wiscoCounties)){

    #Pull out the i'th county name and paste it in a string
    wiscoResult <- wiscoSixteen %>% html_node(paste("#county"," .results-table", sep=wiscoCounties[i])) %>% html_table()

    #add a column for the county name so I can ID later
    wiscoResult[,4] <- wiscoCounties[i]

    #then rbind 
    stateDf <- rbind(stateDf, wiscoResult)
}

When it gets through the 10th county it stops and returns 'Error: No matches'.

Can't find anything unique about 'Columbia', the 11th county. At a loss for what's happening. I'm sure it's something stupid as that's usually the case. Any help is appreciated.


回答1:


So, why not just use the XHR requests that end up populating those tables (I'm kinda surprised you're getting any data at all from them since they get generated from a separate data request):

library(httr)
library(stringi)
library(purrr)
library(dplyr)

res <- GET("http://s3.amazonaws.com/origin-east-elections.politico.com/mapdata/2016/WI_20161108.xml")
dat <- readLines(textConnection(content(res, as="text")))

stri_split_fixed(dat[2], "|")[[1]] %>%
  stri_replace_last_fixed(";", "") %>% 
  stri_split_fixed(";", 3) %>% 
  map_df(~setNames(as.list(.), c("rep_id", "first", "last"))) -> candidates

dat[stri_detect_regex(dat, "^WI;P;G")] %>% 
  stri_replace_first_regex("^WI;P;G;", "") %>% 
  map_df(function(x) {

    county_results <- stri_split_fixed(x, "||", 2)[[1]]

    stri_replace_last_fixed(county_results[1], ";;", "") %>% 
      stri_split_fixed(";") %>% 
      map_df(~setNames(as.list(.), c("fips", "name", "x1", "reporting", "x2", "x3", "x4"))) -> county_prefix

    stri_split_fixed(county_results[2], "|")[[1]] %>% 
      stri_split_fixed(";") %>% 
      map_df(~setNames(as.list(.), c("rep_id", "party", "count", "pct", "x5", "x6", "x7", "x8", "candidate_idx"))) %>% 
      left_join(candidates, by="rep_id") -> df

    df$fips <- county_prefix$fips
    df$name <- county_prefix$name
    df$reporting <- county_prefix$reporting

    select(df, -starts_with("x"))

  }) -> results

It seems to be complete data:

glimpse(results)
## Observations: 511
## Variables: 10
## $ rep_id        <chr> "WI270631108", "WI270621108", "WI270691108", "WI270711108", "WI270701108", "WI270731108", "WI270721108",...
## $ party         <chr> "Dem", "GOP", "Lib", "CST", "ADP", "WW", "Grn", "Dem", "GOP", "Lib", "CST", "ADP", "WW", "Grn", "Dem", "...
## $ count         <chr> "1382210", "1409467", "106442", "12179", "1561", "1781", "30980", "3780", "5983", "207", "44", "4", "9",...
## $ pct           <chr> "46.9", "47.9", "3.6", "0.4", "0.1", "0.1", "1.1", "37.4", "59.2", "2.0", "0.4", "0.0", "0.1", "0.8", "5...
## $ candidate_idx <chr> "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7",...
## $ first         <chr> "Clinton", "Trump", "Johnson", "Castle", "De La Fuente", "Moorehead", "Stein", "Clinton", "Trump", "John...
## $ last          <chr> "Hillary", "Donald", "Gary", "Darrell", "Rocky", "Monica", "Jill", "Hillary", "Donald", "Gary", "Darrell...
## $ fips          <chr> "0", "0", "0", "0", "0", "0", "0", "55001", "55001", "55001", "55001", "55001", "55001", "55001", "55003...
## $ name          <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Adams", "Ada...
## $ reporting     <chr> "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100....

Despite the ".xml" extension on the URL, it's not XML data. I also don't know what some of the columns actually are, but you can dig into that. Also, there's a whole other section of data:

WI;S;G;0;Wisconsin;X;100.0;X;;50885;;||WI269201108;Dem;1380496;46.8;;X;;;1|WI267231108;GOP;1479262;50.2;X;X;X;;2|WI270541108;Lib;87291;3.0;;X;;;3
WI;S;G;55001;Adams;X;100.0;X;;50885;;||WI269201108;Dem;4093;41.2;;X;;;1|WI267231108;GOP;5346;53.9;X;X;X;;2|WI270541108;Lib;486;4.9;;X;;;3
WI;S;G;55003;Ashland;X;100.0;X;;50885;;||WI269201108;Dem;4349;55.1;;X;;;1|WI267231108;GOP;3337;42.2;X;X;X;;2|WI270541108;Lib;214;2.7;;X;;;3
WI;S;G;55005;Barron;X;100.0;X;;50885;;||WI269201108;Dem;8691;38.8;;X;;;1|WI267231108;GOP;12863;57.4;X;X;X;;2|WI270541108;Lib;853;3.8;;X;;;3
WI;S;G;55007;Bayfield;X;100.0;X;;50885;;||WI269201108;Dem;5161;54.6;;X;;;1|WI267231108;GOP;4022;42.6;X;X;X;;2|WI270541108;Lib;263;2.8;;X;;;3
WI;S;G;55009;Brown;X;100.0;X;;50885;;||WI269201108;Dem;51004;40.0;;X;;;1|WI267231108;GOP;71750;56.3;X;X;X;;2|WI270541108;Lib;4615;3.6;;X;;;3
WI;S;G;55011;Buffalo;X;100.0;X;;50885;;||WI269201108;Dem;2746;39.9;;X;;;1|WI267231108;GOP;3850;56.0;X;X;X;;2|WI270541108;Lib;285;4.1;;X;;;3
WI;S;G;55013;Burnett;X;100.0;X;;50885;;||WI269201108;Dem;3143;37.4;;X;;;1|WI267231108;GOP;4998;59.5;X;X;X;;2|WI270541108;Lib;258;3.1;;X;;;3

which obviously means something for that page (it's kinda obvious, but I'm so weary from the election that I'm kinda done with the data) and you can process in similar fashion as what is above.



来源:https://stackoverflow.com/questions/40638511/using-rvest-to-grab-data-returns-no-matches

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!