Web scraping the make/model/year of VIN numbers in RStudio

若如初见. 提交于 2019-12-21 18:03:30

问题


I am currently working on a project where I need to find the manufacturer, model, and year of VIN numbers. I have a list of 300 different VIN numbers. Going through each individual VIN number and manually inputting the manufacturer, model, and year into excel is very inefficient and tedious.

I have tried using the Rvest packages with SelectorGadget to write a few lines of code in R in order to scrape this site to obtain the information but I was not successful: http://www.vindecoder.net/?vin=1G2HX54K724118697&submit=Decode

Here is my code:

library("rvest")
Vnum = "1G2HX54K724118697"
site <- paste("http://www.vindecoder.net/?vin=", Vnum,"&submit=Decode",sep="")
htmlpage <- html(site)
VINhtml <- html_nodes(htmlpage, ".odd:nth-child(6) , .even:nth-child(5) , .even:nth-child(7)")
VIN <- html_text(forecasthtml)
paste(forecast, collapse =" ")

When I try to run VINhtml, I get the error message: list() attr(,"class") [1] "XMLNodeSet"

I do not know what I am doing wrong. I think it is not working because it is a dynamic webpage but I could be wrong. Does anyone have any suggestions on the best way to approach this problem?

I am also open to using other websites or alternative approaches to figuring this out. I just want to find the model, manufacturer, and model year of these VINs. Can anyone please help me in finding an efficient way of doing this?

Here is some sample VINs: YV4SZ592561226129 YV4SZ592371288470 YV4SZ592371257784 YV4CZ982871331598 YV4CZ982581428985 YV4CZ982481423003 YV4CZ982381423543 YV4CZ982171380593 YV4CZ982081460887 YV4CZ852361288222 YV4CZ852281454409 YV4CZ852281454409 YV4CZ852281454409 YV4CZ592861304665 YV4CZ592861267682 YV4CZ592561266859


回答1:


Here is the solution using RSelenium and rvest.

To run RSelenium, you have to first download selenium server from here (Mine is 2.45 version). Let's say the downloaded file is in My Documents directory. Then, you have to run following two steps in cmd before running RSelenium in IDE.
Type following in cmd: a) cd My Documents # I have selenium driver installed in My Documents folder b) and then type: java -jar selenium-server-standalone-2.45.0.jar

library(RSelenium)
library(rvest) 
startServer() 
remDr <- remoteDriver(browserName = 'firefox')
remDr$open()
Vnum<- c("YV4SZ592371288470","1G2HX54K724118697","YV4SZ592371288470")

kk<-lapply(Vnum,function(j){

  remDr$navigate(paste("http://www.vindecoder.net/?vin=",j,"&submit=Decode",sep=""))
  Sys.sleep(30) # this is critical
  test.html <- html(remDr$getPageSource()[[1]]) # this is RSelenium but after this we can use rvest functions until we close the session
  test.text<-test.html%>%
  html_nodes(".odd:nth-child(6) , .even:nth-child(5) , .even:nth-child(7)")%>%
  html_text()
})
kk
[[1]]
[1] "Model: XC70"                          "Type: Multipurpose Passenger Vehicle" "Make: Volvo"                         

[[2]]
[1] "Model: Bonneville"            "Make (Manufacturer): Pontiac" "Model year: 2002"            

[[3]]
[1] "Model: XC70"                          "Type: Multipurpose Passenger Vehicle" "Make: Volvo"   

remDr$close()

P.S. You can see that the same css path is not applicable for all VINs. You have to figure out that in advance (I just used the path that you provided in the question). You can use some sort of tryCatch.



来源:https://stackoverflow.com/questions/30780170/web-scraping-the-make-model-year-of-vin-numbers-in-rstudio

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!