Creating a table by web-scraping using a loop

好久不见. 提交于 2019-12-06 05:54:39

问题


I'm attempting to webscrape tax-rates.org to get the average tax percentage for each county in Texas. I have a list of 255 counties in an csv file which I import as "TX_counties", it's a single column table. I have to create the URL for each county as a string, so I set d1 to the first cell using [i,1], then concat it into a URL string, perform the scrape, then add +1 to [i] which makes it go to the second cell for the next county name, and the process continues.

The problem is I can't figure out how to store the scrape results into a "growing list" which I then want to make into a table and save to .csv file at the end. I'm only able to scrape one county at a time and then it re-writes over itself.

Any thoughts? (fairly new to R and scraping in general)

i <- 1
for (i in 1:255) {

  d1 <- as.character(TX_counties[i,1])

  uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')

  html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)

  avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)

  t1 <- data.table(d1,avg_taxrate)

  i <- i+1

}

write.csv(t1,"2015_TX_PropertyTaxes.csv")

回答1:


This uses rvest, provides a progress bar and takes advantage of the fact that the URLs are already there for you on the page:

library(rvest)
library(pbapply)

pg <- read_html("http://www.tax-rates.org/texas/property-tax")

# get all the county tax table links
ctys <- html_nodes(pg, "table.propertyTaxTable > tr > td > a[href*='county_property']")

# match your lowercased names
county_name <- tolower(gsub(" County", "", html_text(ctys)))

# spider each page and return the rate %
county_rate <- pbsapply(html_attr(ctys, "href"), function(URL) {
  cty_pg <- read_html(URL)
  html_text(html_nodes(cty_pg, xpath="//div[@class='box']/div/div[1]/i[1]"))
}, USE.NAMES=FALSE)

tax_table <- data.frame(county_name, county_rate, stringsAsFactors=FALSE)

tax_table
##   county_name              county_rate
## 1    anderson Avg. 1.24% of home value
## 2     andrews Avg. 0.88% of home value
## 3    angelina Avg. 1.35% of home value
## 4     aransas Avg. 1.29% of home value

write.csv(tax_table, "2015_TX_PropertyTaxes.csv")

NOTE 1: I limited scraping to 4 to not kill the bandwidth of a site that offers free data.

NOTE 2: There are only 254 county tax links available on that site, so you seem to have an extra one if you have 255.




回答2:


library(RCurl)
library(XML)
tx_c <- c("anderson", "andrews")

res <- sapply(1:2, function(x){
    d1 <- as.character(tx_c[x])
    uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')
    html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)
    avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)
    return(c(d1, avg_taxrate))
})

res.df <- data.frame(t(res), stringsAsFactors = FALSE)
names(res.df) <- c("county", "property")
res.df
#    county                 property
# 1 anderson Avg. 1.24% of home value
# 2  andrews Avg. 0.88% of home value



回答3:


you should first initialise a list to store the data scraped with each loop. make sure to initialise it before you go into the loop

then, with each iteration, append on to the list before starting the next iteration. see my answer here

Web Scraping in R with loop from data.frame



来源:https://stackoverflow.com/questions/33771265/creating-a-table-by-web-scraping-using-a-loop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!