could not read webpage with read_html using rvest package from r

醉酒当歌 提交于 2020-01-24 09:11:54

问题


I'm trying to scrape the location of product reviewers from amazon. For example, this webpage

[https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8][1]

I need to get HAINESVILLE, ILLINOIS, United States

I use rvest package for webscraping.

Here is what I did:

library(rvest)       
url='https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'
page = read_html(url)

I got error like below:

Error in open.connection(x, "rb") : HTTP error 403.

But, the following works:

con <- url(url, "rb")
page = read_html(con)

However, with the page I read, I could not extract any text. For example, I want to extract the location of the reviewer.

page %>%
    html_nodes("#customer-profile-name-header .a-size-base a-color-base")%>%
    html_text()

I got nothing

character(0)

Can anyone help figure what I did wrong? Thanks a lot in advance.


回答1:


This should work:

library(dplyr)
library(rvest)
library(stringr)

# get url
url='https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'

# open page
con <- url(url, "rb")
page = read_html(con)

# get the desired information, using View Page Source
page %>%
  html_nodes(xpath=".//script[contains(., 'occupation')]")%>%
  html_text() %>% as.character() %>% str_match(.,"location\":\"(.*?)\",\"personalDescription") -> res

res[,2]


来源:https://stackoverflow.com/questions/56064293/could-not-read-webpage-with-read-html-using-rvest-package-from-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!