Web scraping of image | 易学教程

问题

I am a beginner.

I created a small code to web scraping with rvest. I found a very convenient code %>% html_node ()%>% html_text ()%>% as.numeric (), but I was not able to correctly change the code for scraping url of image.

My code for web scraping url of image:

UrlPage <- html ("http://eyeonhousing.org/2012/11/gdp-growth-in-the-third-quarter-improved-but-still-slow/")

img <- UrlPage%>% html_node (". wp-image-5984")%>% html_attrs ()

Result:

class "Aligncenter size-full wp-image-5984" `enter code here`title "Blog gdp 2012_10_1" alt '" src "Http://eyeonhousing.files.wordpress.com/2012/11/blog-gdp-2012_10_1.jpg" height "337" width "450"

Question. How to get the only link without other attributes? (only )

Please help me find a solution. Thank you!

回答1:

You need to specify which attribute you want to extract as a parameter for html_attr. Also, you may want to make your CSS selector, the parameter for html_node, more specific. Here is my code:

library(rvest)

UrlPage <- html ("http://eyeonhousing.org/2012/11/gdp-growth-in-the-third-quarter-improved-but-still-slow/")
ImgNode <- UrlPage %>% html_node("img.wp-image-5984")
link <- html_attr(ImgNode, "src")

The link variable now contains the URL.

You can find a decent reference for css selectors here: http://www.w3schools.com/cssref/css_selectors.asp

Also the rvest documentation has some good examples on how to use its functions: http://cran.r-project.org/web/packages/rvest/rvest.pdf

回答2:

klib is right. just updated html (deprecated) to read_html and added a download command.

library(rvest)    

myurl <- read_html ("http://eyeonhousing.org/2012/11/gdp-growth-in-the-third-quarter-improved-but-still-slow/")
mynode <- myurl %>% html_node("img.wp-image-5984")
link <- html_attr(mynode, "src")
download.file(url = link,destfile = "test.jpg")

来源：https://stackoverflow.com/questions/30693476/web-scraping-of-image

标签

rvest