scraping HTML data.table using rvest

北慕城南 提交于 2020-01-06 07:15:24

问题


I'm trying to scrape the "Fish Sampled" table data from Minnesota DNR using R rvest package. I used the chrome extension SelectorGadget to find the xpath for the table. I'm unable to get any table data from the webpage into R. Any help is appreciated

library(rvest)

urllakes<- read_html("http://www.dnr.state.mn.us/lakefind/showreport.html?
downum=27011700")

lakesnodes <- html_nodes(urllakes,xpath = '//*[(@id = "lake-survey")]')

html_table(lakesnodes,fill=TRUE) #Error: html_name(x) == "table" is not TRUE
html_text(lakesnodes) # [1] "" but no data is returned 

回答1:


Start a new tab. Open Developer Tools. Then, go to http://www.dnr.state.mn.us/lakefind/showreport.html?downum=27011700.

Go to the Network tab. Look for this:

That's your target. With the following, you can pass in a MN DNR URL or just the id at the end of the URL and get data back.

library(httr)
library(jsonlite)

read_lake_survey <- function(orig_url_or_id) {

  orig_url_or_id <- orig_url_or_id[1]

  if (grepl("^htt", orig_url_or_id)) {
    tmp <- httr::parse_url(orig_url_or_id)
    if (!is.null(tmp$query$downum)) {
      orig_url_or_id <- tmp$query$downum
    } else {
      stop("Invalid URL specified", call.=FALSE)
    }
  }

  httr::GET(
    url = "http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi",
    query = list(
      type = "lake_survey",
      callback = "",
      id = orig_url_or_id,
      `_` = as.numeric(Sys.time())
    )
  ) -> res

  httr::stop_for_status(res)

  out <- httr::content(res, as="text", encoding="UTF-8")
  out <- jsonlite::fromJSON(out, flatten=TRUE)
  out

}

Like so:

orig_url <- "http://www.dnr.state.mn.us/lakefind/showreport.html?downum=27011700"

str(read_lake_survey(orig_url), 2)
## List of 4
##  $ timestamp: int 1506900750
##  $ status   : chr "SUCCESS"
##  $ result   :List of 13
##   ..$ averageWaterClarity: chr "7.0"
##   ..$ sampledPlants      : list()
##   ..$ officeCode         : chr "F314"
##   ..$ littoralAcres      : int 76
##   ..$ shoreLengthMiles   : num 2.45
##   ..$ areaAcres          : num 152
##   ..$ surveys            :'data.frame':  6 obs. of  52 variables:
##   ..$ accesses           :'data.frame':  1 obs. of  5 variables:
##   ..$ lakeName           : chr "Weaver"
##   ..$ DOWNumber          : chr "27011700"
##   ..$ waterClarity       : chr [1, 1:2] "07/14/2008" "7"
##   ..$ meanDepthFeet      : num 20.7
##   ..$ maxDepthFeet       : int 57
##  $ message  : chr "Normal execution."

str(read_lake_survey("27011700"), 2)
## List of 4
##  $ timestamp: int 1506900750
##  $ status   : chr "SUCCESS"
##  $ result   :List of 13
##   ..$ averageWaterClarity: chr "7.0"
##   ..$ sampledPlants      : list()
##   ..$ officeCode         : chr "F314"
##   ..$ littoralAcres      : int 76
##   ..$ shoreLengthMiles   : num 2.45
##   ..$ areaAcres          : num 152
##   ..$ surveys            :'data.frame':  6 obs. of  52 variables:
##   ..$ accesses           :'data.frame':  1 obs. of  5 variables:
##   ..$ lakeName           : chr "Weaver"
##   ..$ DOWNumber          : chr "27011700"
##   ..$ waterClarity       : chr [1, 1:2] "07/14/2008" "7"
##   ..$ meanDepthFeet      : num 20.7
##   ..$ maxDepthFeet       : int 57
##  $ message  : chr "Normal execution."

str(read_lake_survey("http://example.com"))
##  Error: Invalid URL specified 
##    3. stop("Invalid URL specified", call. = FALSE) 
##    2. read_lake_survey("http://example.com") 
##    1. str(read_lake_survey("http://example.com")) 

You can poke at it to prove it's all there.

library(tidyverse)

# get the data into a variable
dat <- read_lake_survey(orig_url)

# focus on the surveys
surveys <- dat$result$surveys

There are "n" data frames for the surveys that match the popup on the page.

There are also many other list elements with "n" entries that are associated with the surveys in the same popup. I don't do this type of analysis so i don't know what makes sense to put with the data frames or not.

This is likely enough to get you going a bit further. It's just adding other elements to the surveys.

map2(surveys$fishCatchSummaries, surveys$surveyDate, ~{ .x$survey_date <- .y ; .x }) %>% 
  map2(surveys$surveyType, ~{ .x$survey_type <- .y ; .x }) %>% 
  map2(surveys$surveySubType, ~{ .x$survey_subtype <- .y ; .x }) %>% 
  map2_df(surveys$surveyID, ~{ .$survey_id <- .y ; .x }) %>% 
  as_tibble() %>% 
  type_convert() %>% 
  glimpse()
## Observations: 120
## Variables: 12
## $ quartileCount  <chr> "0.5-7.5", "0.7-4.2", "N/A", "0.4-2.2", "0.9-5.7", "1.5-7.3"...
## $ CPUE           <dbl> 25.0, 3.6, 4.0, 0.5, 5.0, 17.5, 6.5, 1.0, 0.8, 0.2, 190.0, 0...
## $ totalCatch     <int> 50, 18, 20, 1, 25, 35, 13, 2, 4, 1, 950, 1, 2, 4, 3, 13, 27,...
## $ species        <chr> "YEB", "PMK", "HSF", "WTS", "YEB", "NOP", "BLG", "BLC", "BLC...
## $ totalWeight    <dbl> 41.75, 2.30, 4.50, 3.50, 24.25, 146.25, 3.25, 0.60, 1.45, 2....
## $ quartileWeight <chr> "0.5-0.8", "0.1-0.2", "N/A", "1.5-2.4", "0.5-0.8", "2.0-3.5"...
## $ averageWeight  <dbl> 0.83, 0.13, 0.23, 3.50, 0.97, 4.18, 0.25, 0.30, 0.36, 2.50, ...
## $ gearCount      <int> 2, 5, 5, 2, 5, 2, 2, 2, 5, 5, 5, 2, 2, 2, 5, 2, 5, 5, 5, 2, ...
## $ gear           <chr> "Standard gill nets", "Standard trap nets", "Standard trap n...
## $ survey_date    <date> 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23,...
## $ survey_type    <chr> "Standard Survey", "Standard Survey", "Standard Survey", "St...
## $ survey_subtype <chr> "Population Assessment", "Population Assessment", "Populatio...

If you're not familiar with piping, it's just a way to avoid temporary variables.

tmp <- map2(surveys$fishCatchSummaries, surveys$surveyDate, ~{ .x$survey_date <- .y ; .x })
tmp <- map2(tmp, surveys$surveyType, ~{ .x$survey_type <- .y ; .x })
tmp <- map2(tmp, surveys$surveySubType, ~{ .x$survey_subtype <- .y ; .x })
tmp <- map2_df(tmp, surveys$surveyID, ~{ .$survey_id <- .y ; .x })
tmp <- as_tibble(tmp)
final_data <- type_convert(tmp)

glimpse(final_data)
## Observations: 120
## Variables: 12
## $ quartileCount  <chr> "0.5-7.5", "0.7-4.2", "N/A", "0.4-2.2", "0.9-5.7", "1.5-7.3"...
## $ CPUE           <dbl> 25.0, 3.6, 4.0, 0.5, 5.0, 17.5, 6.5, 1.0, 0.8, 0.2, 190.0, 0...
## $ totalCatch     <int> 50, 18, 20, 1, 25, 35, 13, 2, 4, 1, 950, 1, 2, 4, 3, 13, 27,...
## $ species        <chr> "YEB", "PMK", "HSF", "WTS", "YEB", "NOP", "BLG", "BLC", "BLC...
## $ totalWeight    <dbl> 41.75, 2.30, 4.50, 3.50, 24.25, 146.25, 3.25, 0.60, 1.45, 2....
## $ quartileWeight <chr> "0.5-0.8", "0.1-0.2", "N/A", "1.5-2.4", "0.5-0.8", "2.0-3.5"...
## $ averageWeight  <dbl> 0.83, 0.13, 0.23, 3.50, 0.97, 4.18, 0.25, 0.30, 0.36, 2.50, ...
## $ gearCount      <int> 2, 5, 5, 2, 5, 2, 2, 2, 5, 5, 5, 2, 2, 2, 5, 2, 5, 5, 5, 2, ...
## $ gear           <chr> "Standard gill nets", "Standard trap nets", "Standard trap n...
## $ survey_date    <date> 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23,...
## $ survey_type    <chr> "Standard Survey", "Standard Survey", "Standard Survey", "St...
## $ survey_subtype <chr> "Population Assessment", "Population Assessment", "Populatio...

final_data
## # A tibble: 120 x 12
##    quartileCount  CPUE totalCatch species totalWeight quartileWeight averageWeight gearCount               gear survey_date     survey_type        survey_subtype
##            <chr> <dbl>      <int>   <chr>       <dbl>          <chr>         <dbl>     <int>              <chr>      <date>           <chr>                 <chr>
##  1       0.5-7.5  25.0         50     YEB       41.75        0.5-0.8          0.83         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  2       0.7-4.2   3.6         18     PMK        2.30        0.1-0.2          0.13         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
##  3           N/A   4.0         20     HSF        4.50            N/A          0.23         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
##  4       0.4-2.2   0.5          1     WTS        3.50        1.5-2.4          3.50         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  5       0.9-5.7   5.0         25     YEB       24.25        0.5-0.8          0.97         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
##  6       1.5-7.3  17.5         35     NOP      146.25        2.0-3.5          4.18         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  7           N/A   6.5         13     BLG        3.25            N/A          0.25         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  8      2.5-16.5   1.0          2     BLC        0.60        0.1-0.3          0.30         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  9      1.8-21.2   0.8          4     BLC        1.45        0.2-0.3          0.36         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
## 10           N/A   0.2          1     NOP        2.50            N/A          2.50         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
## # ... with 110 more rows


来源:https://stackoverflow.com/questions/46517463/scraping-html-data-table-using-rvest

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!