I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks). In order to learn scraping, I am
One of the users, Parfait, helped me to sort out the issues. So, a very big thank you goes to this user. Below I post the script. I apologize if it is not presicely commented.
Here is the code.
#Loading packages
library('rvest') #to scrape
library('xml2') #to handle missing values (it works with html_node, not with html_nodes)
library('plyr') #to bind together different data sets
#get working directory
getwd()
setwd("~/YOUR OWN FOLDER HERE")
#DEFINE SCRAPING FUNCTION
getProfile <- function(URL) {
##NAME
#Using CSS selectors to name
nam_html <- html_node(URL,'.contact-name')
#Converting the name data to text
nam <- html_text(nam_html)
#Let's have a look at the rankings
head(nam)
#Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help
# me separate each item within the same type of
# information)
nam<-gsub("\n","",nam)
head(nam)
#Convering each info from text to factor
nam<-as.factor(nam)
#Let's have a look at the name
head(nam)
#If I need to remove blank space do this:
#Data-Preprocessing: removing excess spaces
#variable<-gsub(" ","",variable)
##MODALITIES
#Using CSS selectors to modality
mod_html <- html_node(URL,'.attributes-modality .copy-small')
#Converting the name data to text
mod <- html_text(mod_html)
#Let's have a look at the rankings
head(mod)
#Convering each info from text to factor
mod<-as.factor(mod)
#Let's have a look at the rankings
head(mod)
##Combining all the lists to form a data frame
onet_df<-data.frame(Name = nam,
Modality = mod)
return(onet_df)
}
Then, I apply this function with a loop to a few therapists. For illustrative purposes, I take four adjacent therapists' ID, without knowing apriori whether each of these IDs have been actually assigned (this is done because I want to see what happens if the loop stumbles on a non-existen link).
j <- 1
MHP_codes <- c(163805:163808) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- read_html(URL)
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NULL)
j <- j + 1
}
final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")