问题
I am trying to gather a dataset from this site called ICObench. I've managed to extract the names of each ICO in the 91 pages using rvest and purrr, but Im confused as to how I can extract data behind each name in the list. All the names are clickable links. This is the code so far:
url_base <- "https://icobench.com/icos?page=%d&filterBonus=&filterBounty=&filterTeam=&filterExpert=&filterSort=&filterCategory=all&filterRating=any&filterStatus=ended&filterCountry=any&filterRegistration=0&filterExcludeArea=none&filterPlatform=any&filterCurrency=any&filterTrading=any&s=&filterStartAfter=&filterEndBefore="
map_df(1:91,function(i){
page <- read_html(sprintf(url_base,i))
data.frame(ICOname = html_text(html_nodes(page,".name")))
})->ICOdataset
Is there any way that I can match the specific info behind each name to the existing list so that R automatically extracts it for all ICOs? For example: https://icobench.com/ico/domraider - I would like the funding amount, token, country, etc.
Any help would be greatly appreciated!
回答1:
First load library(tidyverse); library(rvest)
...and a warning -- this code is not efficient (i.e., you could avoid expanding the list() structure by using lapply or making the code more purrr, but I'll leave that as an exercise)
So the crux of the answer is in starting a session via rvest::html_session()
and then using rvest::follow_link()
and/or rvest::jump_to()
but there are few other data cleansing challenges, so I thought I would do a more complete answer. Since you already have all the "links" you want to follow in your ICOdataset
variable, we can leverage that and build a function that gets the data for any particular ICO
page.
For example, assuming we've already followed ../ico/domraider
, we can write a function, get_data_for_ico()
to extract it's relevant info:
get_data_for_ico <- function(ico_page) {
raised <-
ico_page %>%
html_node(".raised") %>%
html_text()
data <-
ico_page %>%
html_nodes(".data_row .col_2") %>%
html_text(trim = T)
data_df <- data.frame(raised, t(data[c(FALSE, TRUE)]))
names(data_df) <- c("raised", t(data[c(TRUE, FALSE)]))
return(data_df)
}
Noting that the table
of data from the second selector (.data_row .col_2
) is not ideal, but it will work and is par for the course when it comes to scraping. The data[c(FALSE, TRUE)]
and data[c(TRUE, FALSE)]
is to pull every-odd or every-even element, respectively. Why? Because you'll notice that the table of data is not consistent by ICO
so we'll need a varying length data.frame
that dynamically assigns its names.
Now we can start a session and loop through the ICO
's, using jump_to()
and run our function while storing the results in a list.
results <- list()
s <- html_session(sprintf(url_base, 1))
for (ico in seq_along(ICOdataset$ICOname)) {
clean_ico <-
ICOdataset$ICOname[ico] %>%
str_to_lower() %>%
str_replace_all("\\s|\\.", "-")
link_name <- paste0("ico/", clean_ico)
message(link_name)
results[[clean_ico]] <-
s %>%
jump_to(link_name) %>%
get_data_for_ico()
}
Noting that you need to clean-up the names from your original scrape so that they are URL-friendly (i.e., replace spaces and periods with hyphens).
Now that we have our results in a list, we can convert to a pretty tibble
like so:
results_df <-
bind_rows(results, .id = "ICO") %>%
as_data_frame()
# # A tibble: 60 x 12
# ICO raised Token `Price in ICO` Country `preICO start` `preICO end`
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 domrai~ ~$45,0~ DRT 0.12 USD France 16th Aug 2017 11th Sep 20~
# 2 genesi~ ~$2,83~ GVT 1.00 USD Russia 15th Sep 2017 5th Oct 2017
# 3 latoken ~$20,0~ LAT 0.30 USD Singap~ NA NA
# 4 vibera~ ~$10,7~ VIB 0.10 USD Sloven~ NA NA
# 5 wepower ~$40,0~ WPR 0.13 USD Gibral~ 22nd Sep 2017 23rd Oct 20~
# 6 xinfin NA XDCE 1 ETH = 133,0~ Singap~ 1st Jun 2017 31st Jul 20~
# 7 aeron ~$5,68~ ARN 0.50 USD Belize 1st Sep 2017 19th Sep 20~
# 8 ambros~ ~$30,0~ AMB 0.29 USD Switze~ NA NA
# 9 appcoi~ ~$15,3~ APPC 0.10 USD Singap~ 6th Nov 2017 20th Nov 20~
# 10 bankex ~$70,6~ BKX 1 ETH = 500 B~ USA NA NA
# # ... with 50 more rows, and 5 more variables: `ICO start` <chr>, `ICO
# # end` <chr>, `Whitelist/KYC` <chr>, `Restricted areas` <chr>, `Price in
# # preICO` <chr>
来源:https://stackoverflow.com/questions/49390374/web-scraping-the-data-behind-every-url-from-a-list-of-urls