Empty nodes when scraping links with rvest in R

强颜欢笑 提交于 2020-01-11 13:47:10

问题


My goal is to get links to all challenges of Kaggle with their title. I am using the library rvest for it but I do not seem to come far. The nodes are empty when I am a few divs in.

I am trying to do it for the first challenge at first and should be able to transfer that to every entry afterwards. The xpath of the first entry is:

/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a

My idea was to get the link via html_attr( , "href") once I am in the right tag.

My idea is:

library(rvest)

url = "https://www.kaggle.com/competitions"
kaggle_html = read_html(url)
kaggle_text = html_text(kaggle_html)
kaggle_node <- html_nodes(kaggle_html, xpath = "/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a")
html_attr(kaggle_node, "href")

I cant go past a certain div. The following snippet shows the last node I can access

node <- html_nodes(kaggle_html, xpath="/html/body/div[1]/div[2]/div")
html_attrs(node)

Once I go one step further with html_nodes(kaggle_html,xpath="/html/body/div[1]/div[2]/div/div"), the node will be empty.

I think the issue is that kaggle uses a smart list that expands the further I scroll down.

(I am aware that I can use %>%. I am saving every step so that I am able to access and view them more easily to be able to learn how it properly works.)


回答1:


I solved the issue. I think that I can not access the full html code of the site from R because the table is loaded by a script which expands the table (thus the HTML) with a user scrolling through.

I resolved it, by expanding the table manually, downloading the whole HTML webpage and loading the local file.



来源:https://stackoverflow.com/questions/49336132/empty-nodes-when-scraping-links-with-rvest-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!