Empty nodes when scraping links with rvest in R

问题

My goal is to get links to all challenges of Kaggle with their title. I am using the library rvest for it but I do not seem to come far. The nodes are empty when I am a few divs in.

I am trying to do it for the first challenge at first and should be able to transfer that to every entry afterwards. The xpath of the first entry is:

/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a

My idea was to get the link via html_attr( , "href") once I am in the right tag.

My idea is:

library(rvest)

url = "https://www.kaggle.com/competitions"
kaggle_html = read_html(url)
kaggle_text = html_text(kaggle_html)
kaggle_node <- html_nodes(kaggle_html, xpath = "/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a")
html_attr(kaggle_node, "href")

I cant go past a certain div. The following snippet shows the last node I can access

node <- html_nodes(kaggle_html, xpath="/html/body/div[1]/div[2]/div")
html_attrs(node)

Once I go one step further with html_nodes(kaggle_html,xpath="/html/body/div[1]/div[2]/div/div"), the node will be empty.

I think the issue is that kaggle uses a smart list that expands the further I scroll down.

(I am aware that I can use %>%. I am saving every step so that I am able to access and view them more easily to be able to learn how it properly works.)

回答1:

I solved the issue. I think that I can not access the full html code of the site from R because the table is loaded by a script which expands the table (thus the HTML) with a user scrolling through.

I resolved it, by expanding the table manually, downloading the whole HTML webpage and loading the local file.

来源：https://stackoverflow.com/questions/49336132/empty-nodes-when-scraping-links-with-rvest-in-r

标签

web-scraping

rvest