Web scraping the data behind every url from a list of urls

问题

I am trying to gather a dataset from this site called ICObench. I've managed to extract the names of each ICO in the 91 pages using rvest and purrr, but Im confused as to how I can extract data behind each name in the list. All the names are clickable links. This is the code so far:

url_base <- "https://icobench.com/icos?page=%d&filterBonus=&filterBounty=&filterTeam=&filterExpert=&filterSort=&filterCategory=all&filterRating=any&filterStatus=ended&filterCountry=any&filterRegistration=0&filterExcludeArea=none&filterPlatform=any&filterCurrency=any&filterTrading=any&s=&filterStartAfter=&filterEndBefore="

map_df(1:91,function(i){
page <- read_html(sprintf(url_base,i))
data.frame(ICOname = html_text(html_nodes(page,".name")))
})->ICOdataset

Is there any way that I can match the specific info behind each name to the existing list so that R automatically extracts it for all ICOs? For example: https://icobench.com/ico/domraider - I would like the funding amount, token, country, etc.

Any help would be greatly appreciated!

回答1:

First load library(tidyverse); library(rvest)...and a warning -- this code is not efficient (i.e., you could avoid expanding the list() structure by using lapply or making the code more purrr, but I'll leave that as an exercise)

So the crux of the answer is in starting a session via rvest::html_session() and then using rvest::follow_link() and/or rvest::jump_to() but there are few other data cleansing challenges, so I thought I would do a more complete answer. Since you already have all the "links" you want to follow in your ICOdataset variable, we can leverage that and build a function that gets the data for any particular ICO page.

For example, assuming we've already followed ../ico/domraider, we can write a function, get_data_for_ico() to extract it's relevant info:

get_data_for_ico <- function(ico_page) {
  raised <- 
    ico_page %>%
    html_node(".raised") %>%
    html_text()
  data <-
    ico_page %>%
    html_nodes(".data_row .col_2") %>%
    html_text(trim = T)

  data_df <- data.frame(raised, t(data[c(FALSE, TRUE)]))
  names(data_df) <- c("raised", t(data[c(TRUE, FALSE)]))
  return(data_df)
}

Noting that the table of data from the second selector (.data_row .col_2) is not ideal, but it will work and is par for the course when it comes to scraping. The data[c(FALSE, TRUE)] and data[c(TRUE, FALSE)] is to pull every-odd or every-even element, respectively. Why? Because you'll notice that the table of data is not consistent by ICO so we'll need a varying length data.frame that dynamically assigns its names.

Now we can start a session and loop through the ICO's, using jump_to() and run our function while storing the results in a list.

results <- list()
s <- html_session(sprintf(url_base, 1))

for (ico in seq_along(ICOdataset$ICOname)) {
  clean_ico <- 
    ICOdataset$ICOname[ico] %>%
    str_to_lower() %>%
    str_replace_all("\\s|\\.", "-")
  link_name <- paste0("ico/", clean_ico)
  message(link_name)

  results[[clean_ico]] <-
  s %>%
    jump_to(link_name) %>%
    get_data_for_ico()
}

Noting that you need to clean-up the names from your original scrape so that they are URL-friendly (i.e., replace spaces and periods with hyphens).

Now that we have our results in a list, we can convert to a pretty tibble like so:

results_df <-
bind_rows(results, .id = "ICO") %>%
  as_data_frame()

# # A tibble: 60 x 12
#    ICO     raised  Token `Price in ICO` Country `preICO start` `preICO end`
#    <chr>   <chr>   <chr> <chr>          <chr>   <chr>          <chr>       
#  1 domrai~ ~$45,0~ DRT   0.12 USD       France  16th Aug 2017  11th Sep 20~
#  2 genesi~ ~$2,83~ GVT   1.00 USD       Russia  15th Sep 2017  5th Oct 2017
#  3 latoken ~$20,0~ LAT   0.30 USD       Singap~ NA             NA          
#  4 vibera~ ~$10,7~ VIB   0.10 USD       Sloven~ NA             NA          
#  5 wepower ~$40,0~ WPR   0.13 USD       Gibral~ 22nd Sep 2017  23rd Oct 20~
#  6 xinfin  NA      XDCE  1 ETH = 133,0~ Singap~ 1st Jun 2017   31st Jul 20~
#  7 aeron   ~$5,68~ ARN   0.50 USD       Belize  1st Sep 2017   19th Sep 20~
#  8 ambros~ ~$30,0~ AMB   0.29 USD       Switze~ NA             NA          
#  9 appcoi~ ~$15,3~ APPC  0.10 USD       Singap~ 6th Nov 2017   20th Nov 20~
# 10 bankex  ~$70,6~ BKX   1 ETH = 500 B~ USA     NA             NA          
# # ... with 50 more rows, and 5 more variables: `ICO start` <chr>, `ICO
# #   end` <chr>, `Whitelist/KYC` <chr>, `Restricted areas` <chr>, `Price in
# #   preICO` <chr>

来源：https://stackoverflow.com/questions/49390374/web-scraping-the-data-behind-every-url-from-a-list-of-urls

标签

web-scraping

rvest

purrr