Creating a table by web-scraping using a loop

拜拜、爱过 提交于 2019-12-04 10:51:54

This uses rvest, provides a progress bar and takes advantage of the fact that the URLs are already there for you on the page:

library(rvest)
library(pbapply)

pg <- read_html("http://www.tax-rates.org/texas/property-tax")

# get all the county tax table links
ctys <- html_nodes(pg, "table.propertyTaxTable > tr > td > a[href*='county_property']")

# match your lowercased names
county_name <- tolower(gsub(" County", "", html_text(ctys)))

# spider each page and return the rate %
county_rate <- pbsapply(html_attr(ctys, "href"), function(URL) {
  cty_pg <- read_html(URL)
  html_text(html_nodes(cty_pg, xpath="//div[@class='box']/div/div[1]/i[1]"))
}, USE.NAMES=FALSE)

tax_table <- data.frame(county_name, county_rate, stringsAsFactors=FALSE)

tax_table
##   county_name              county_rate
## 1    anderson Avg. 1.24% of home value
## 2     andrews Avg. 0.88% of home value
## 3    angelina Avg. 1.35% of home value
## 4     aransas Avg. 1.29% of home value

write.csv(tax_table, "2015_TX_PropertyTaxes.csv")

NOTE 1: I limited scraping to 4 to not kill the bandwidth of a site that offers free data.

NOTE 2: There are only 254 county tax links available on that site, so you seem to have an extra one if you have 255.

library(RCurl)
library(XML)
tx_c <- c("anderson", "andrews")

res <- sapply(1:2, function(x){
    d1 <- as.character(tx_c[x])
    uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')
    html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)
    avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)
    return(c(d1, avg_taxrate))
})

res.df <- data.frame(t(res), stringsAsFactors = FALSE)
names(res.df) <- c("county", "property")
res.df
#    county                 property
# 1 anderson Avg. 1.24% of home value
# 2  andrews Avg. 0.88% of home value

you should first initialise a list to store the data scraped with each loop. make sure to initialise it before you go into the loop

then, with each iteration, append on to the list before starting the next iteration. see my answer here

Web Scraping in R with loop from data.frame

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!