rvest

Scraping a table from a section in Wikipedia

爷,独闯天下 提交于 2019-12-04 17:44:07
I'm trying to come up with a robust way to scrape the final standings of the NFL teams in each season; wonderfully, there is a Wikipedia page with links to all this info. Unfortunately, there is a lot of inconsistency (perhaps to be expected, given the evolution of league structure) in how/where the final standings table is stored. The saving grace should be that the relevant table is always in a section with the word "Standings". Is there some way I can grep a section name and only extract the table node(s) there? Here are some sample pages to demonstrate the structure: 1922 season - Only one

Comatose web crawler in R (w/ rvest)

霸气de小男生 提交于 2019-12-04 16:58:42
I recently discovered the rvest package in R and decided to try out some web scraping. I wrote a small web crawler in a function so I could pipe it down to clean it up etc. With a small url list (e.g. 1-100) the function works fine, however when a larger list is used the function hangs at some point. It seems like one of the commands is waiting for a response but does not seems to get one and does not result in an error. urlscrape<-function(url_list) { library(rvest) library(dplyr) assets<-NA price<-NA description<-NA city<-NA length(url_list)->n pb <- txtProgressBar(min = 0, max = n, style =

rvest: extract tables with url's instead of text

流过昼夜 提交于 2019-12-04 15:42:36
The tables I would like to scrape have url's in them. If I run the code, I get only the column with description of url. How to get the table which actually has a column (in mycase the second column) with URLs instead of their descriptions), or having a full html code of an anchor? . I need it to extract two index codes from the URL's in the second column of table. The links that I would like to scrape look like: https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20795&OW=15&OrthopedicSupply=False&Code=150000001 and I need ProviderId and Code numbers

Coercing rvest to recognize tables (html_tag(x) == “table” is not TRUE)

喜欢而已 提交于 2019-12-04 14:06:02
问题 I can't seem to ever get html_table() to work. This is a perfect example: (Trying to scrape the 6 Games: table) library(rvest) hockey <- html("http://www.hockey-reference.com/boxscores/2015/3/6/") hockey %>% html_nodes("#stats .tooltip , #stats td , #stats a") %>% html_table() But I am getting a html_tag(x) == "table" is not TRUE . It's so obviously a table. How can I coerce rvest to recognize the node as a table? 回答1: Try either: hockey %>% html_table(fill = TRUE) to parse all the tables on

Handling error response to empty webpage from read_html

两盒软妹~` 提交于 2019-12-04 13:31:15
Trying to scrape a web page title but running into a problem with a website called "tweg.com" library(httr) library(rvest) page.url <- "tweg.com" page.get <- GET(page.url) # from httr pg <- read_html(page.get) # from rvest page.title <- html_nodes(pg, "title") %>% html_text() # from rvest read_html stops with an error message: "Error: Failed to parse text". Looking into page.get$content, find that it is empty (raw(0)). Certainly, can write a simple check to take this into account and avoid parsing using read_html. However, feel that a more elegant solution would be to get something back from

Creating a table by web-scraping using a loop

拜拜、爱过 提交于 2019-12-04 10:51:54
I'm attempting to webscrape tax-rates.org to get the average tax percentage for each county in Texas. I have a list of 255 counties in an csv file which I import as "TX_counties", it's a single column table. I have to create the URL for each county as a string, so I set d1 to the first cell using [i,1], then concat it into a URL string, perform the scrape, then add +1 to [i] which makes it go to the second cell for the next county name, and the process continues. The problem is I can't figure out how to store the scrape results into a "growing list" which I then want to make into a table and

R rvest encoding errors with UTF-8

自古美人都是妖i 提交于 2019-12-04 09:23:40
I'm trying to get this table from Wikipedia. The source of the file clamis it's UTF-8: > <!DOCTYPE html> <html lang="en" dir="ltr" class="client-nojs"> <head> > <meta charset="UTF-8"/> <title>List of cities in Colombia - Wikipedia, > the free encyclopedia</title> > ... However, when I try to get the table with rvest it shows weird characters where there should be accented (standard spanish) ones like á, é, etc. This is what I attempted: theurl <- "https://en.wikipedia.org/wiki/List_of_cities_in_Colombia" file <- read_html(theurl, encoding = "UTF-8") tables <- html_nodes(file, "table") pop <-

R - Using rvest to scrape a password protected website without logging in at each loop iteration

若如初见. 提交于 2019-12-04 08:45:05
问题 I'm trying to scrape data from a password protected website in R using the rvest package. My code currently logs in to the website at each iteration of a loop that will run about 15,000 times. This seems very inefficient but I have not figured out a way around it, because jumping to a different url without first logging in every time returns to the website's log in page. A simplification of my code is as follows: library(rvest) url <- password protected website url within quotes session <

Specifying column class in html_table(rvest)

柔情痞子 提交于 2019-12-04 06:14:45
I am using the html_table from rvest to read a two-column concordance table from the website below. Both columns contain instances of leading zeros which I would want to preserve. As such, I would want the columns to be of class character. I use the following code: library(rvest) library(data.table) df <- list() for (j in 1:25) { url <- paste('http://unstats.un.org/unsd/cr/registry/regso.asp?Ci=70&Lg=1&Co=&T=0&p=', j, '&prn=yes', sep='') webpage <- read_html(url) table <- html_nodes(webpage, 'table') df[[j]] <- html_table(table, header=TRUE)[[1]] df[[j]] <- df[[j]][,c(1:2) ] } ISIC4.NACE2 <-

Loop to scrape data from Wikipedia in R

纵饮孤独 提交于 2019-12-04 04:20:20
问题 I am trying to extract data about celebrity/notable deaths for analysis. Wikipedia has a very regular structure to their html paths concerning notable dates of death. It looks like: https://en.wikipedia.org/wiki/Deaths_in_"MONTH"_"YEAR" For example, this link leads to the notable deaths in March, 2014. https://en.wikipedia.org/wiki/Deaths_in_March_2014 I have located the CSS location of the lists I need to be ""#mw-content-text h3+ ul li" and extracted it for a specific link successfully. Now