Scraping data from LinkedIn using RSelenium (and rvest)

左心房为你撑大大i 提交于 2021-02-08 06:47:04

问题


I am trying to scrape some data from famous people on LinkedIn and I have a few problems. I would like do the following:

  1. On Hadley Wickhams page ( https://www.linkedin.com/in/hadleywickham/ ) I would like to use RSelenium to login and "click" the "Show 1 more education" - and also "Show 1 more experience" (note Hadley does not have the option to "Show 1 more experience" but does have the option to "Show 1 more education"). (by clicking the "Show more experience/education" allows me to scrape the full education and experience from the page). Alternatively Ted Cruz has an option to "Show 5 more experiences" which I would like to expand and scrape.

Code:

library(RSelenium)
library(rvest)
library(stringr)
library(xml2)

userID = "myEmailLogin" # The linkedIn email to login
passID = "myPassword"   # and LinkedIn password

try(rsDriver(port = 4444L, browser = 'firefox'))
remDr <- remoteDriver()
remDr$open()
remDr$navigate("https://www.linkedin.com/login")

user <- remDr$findElement(using = 'id',"username")
user$sendKeysToElement(list(userID,key="tab"))

pass <- remDr$findElement(using = 'id',"password")
pass$sendKeysToElement(list(passID,key="enter"))

Sys.sleep(5) # give the page time to fully load
# Navgate to individual profiles
# remDr$navigate("https://www.linkedin.com/in/thejlo/") # Jennifer Lopez
# remDr$navigate("https://www.linkedin.com/in/cruzted/") # Ted Cruz
remDr$navigate("https://www.linkedin.com/in/hadleywickham/") # Hadley Wickham 

Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]


signals <- read_html(html)

personFullNameLocationXPath <- '/html/body/div[9]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/ul[1]/li[1]'
personName <- signals %>%
  html_nodes(xpath = personFullNameLocationXPath) %>% 
  html_text()

personTagLineXPath = '/html/body/div[9]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2'
personTagLine <- signals %>% 
  html_nodes(xpath = personTagLineXPath) %>% 
  html_text()

personLocationXPath <- '//*[@id="ember49"]/div[2]/div[2]/div[1]/ul[2]/li[1]'
personLocation <- signals %>% 
  html_nodes(xpath = personLocationXPath) %>% 
  html_text()

personLocation %>% 
  gsub("[\r\n]", "", .) %>% 
  str_trim(.)

# Here is where I have problems

personExperienceTotalXPath = '//*[@id="experience-section"]/ul'
personExperienceTotal <- signals %>% 
  html_nodes(xpath = personExperienceTotalXPath) %>% 
  html_text()

The very end personExperienceTotal is where I go wrong... I cannot seem to scrape the experience-section. When I put my own LinkedIn URL (or some random person) it seems to work...

My question is, how can I click the expand experience/education and scrape these sections?

来源:https://stackoverflow.com/questions/63784161/scraping-data-from-linkedin-using-rselenium-and-rvest

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!