Creating a table by web-scraping using a loop

问题

I'm attempting to webscrape tax-rates.org to get the average tax percentage for each county in Texas. I have a list of 255 counties in an csv file which I import as "TX_counties", it's a single column table. I have to create the URL for each county as a string, so I set d1 to the first cell using [i,1], then concat it into a URL string, perform the scrape, then add +1 to [i] which makes it go to the second cell for the next county name, and the process continues.

The problem is I can't figure out how to store the scrape results into a "growing list" which I then want to make into a table and save to .csv file at the end. I'm only able to scrape one county at a time and then it re-writes over itself.

Any thoughts? (fairly new to R and scraping in general)

i <- 1
for (i in 1:255) {

  d1 <- as.character(TX_counties[i,1])

  uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')

  html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)

  avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)

  t1 <- data.table(d1,avg_taxrate)

  i <- i+1

}

write.csv(t1,"2015_TX_PropertyTaxes.csv")

回答1:

This uses rvest, provides a progress bar and takes advantage of the fact that the URLs are already there for you on the page:

library(rvest)
library(pbapply)

pg <- read_html("http://www.tax-rates.org/texas/property-tax")

# get all the county tax table links
ctys <- html_nodes(pg, "table.propertyTaxTable > tr > td > a[href*='county_property']")

# match your lowercased names
county_name <- tolower(gsub(" County", "", html_text(ctys)))

# spider each page and return the rate %
county_rate <- pbsapply(html_attr(ctys, "href"), function(URL) {
  cty_pg <- read_html(URL)
  html_text(html_nodes(cty_pg, xpath="//div[@class='box']/div/div[1]/i[1]"))
}, USE.NAMES=FALSE)

tax_table <- data.frame(county_name, county_rate, stringsAsFactors=FALSE)

tax_table
##   county_name              county_rate
## 1    anderson Avg. 1.24% of home value
## 2     andrews Avg. 0.88% of home value
## 3    angelina Avg. 1.35% of home value
## 4     aransas Avg. 1.29% of home value

write.csv(tax_table, "2015_TX_PropertyTaxes.csv")

NOTE 1: I limited scraping to 4 to not kill the bandwidth of a site that offers free data.

NOTE 2: There are only 254 county tax links available on that site, so you seem to have an extra one if you have 255.

回答2:

library(RCurl)
library(XML)
tx_c <- c("anderson", "andrews")

res <- sapply(1:2, function(x){
    d1 <- as.character(tx_c[x])
    uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')
    html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)
    avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)
    return(c(d1, avg_taxrate))
})

res.df <- data.frame(t(res), stringsAsFactors = FALSE)
names(res.df) <- c("county", "property")
res.df
#    county                 property
# 1 anderson Avg. 1.24% of home value
# 2  andrews Avg. 0.88% of home value

回答3:

you should first initialise a list to store the data scraped with each loop. make sure to initialise it before you go into the loop

then, with each iteration, append on to the list before starting the next iteration. see my answer here

Web Scraping in R with loop from data.frame

来源：https://stackoverflow.com/questions/33771265/creating-a-table-by-web-scraping-using-a-loop

标签

for-loop

web-scraping

rvest