Web scraping and looping through pages with R

前端 未结 2 1786
余生分开走
余生分开走 2021-01-13 12:06

I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks). In order to learn scraping, I am

2条回答
  •  渐次进展
    2021-01-13 12:50

    One of the users, Parfait, helped me to sort out the issues. So, a very big thank you goes to this user. Below I post the script. I apologize if it is not presicely commented.

    Here is the code.

    #Loading packages
    library('rvest') #to scrape
    library('xml2')  #to handle missing values (it works with html_node, not with html_nodes)
    library('plyr')  #to bind together different data sets
    
    #get working directory
    getwd()
    setwd("~/YOUR OWN FOLDER HERE")
    
    #DEFINE SCRAPING FUNCTION
    getProfile <- function(URL) {
    
    
              ##NAME
                    #Using CSS selectors to name
                    nam_html <- html_node(URL,'.contact-name')
                    #Converting the name data to text
                    nam <- html_text(nam_html)
                    #Let's have a look at the rankings
                    head(nam)
                    #Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help 
                    #                                   me separate each item within the same type of 
                    #                                   information)
                    nam<-gsub("\n","",nam)
                    head(nam)
                    #Convering each info from text to factor
                    nam<-as.factor(nam)
                    #Let's have a look at the name
                    head(nam)
                    #If I need to remove blank space do this:
                      #Data-Preprocessing: removing excess spaces
                      #variable<-gsub(" ","",variable)
    
    
                ##MODALITIES
                    #Using CSS selectors to modality
                    mod_html <- html_node(URL,'.attributes-modality .copy-small')
                    #Converting the name data to text
                    mod <- html_text(mod_html)
                    #Let's have a look at the rankings
                    head(mod)
                    #Convering each info from text to factor
                    mod<-as.factor(mod)
                    #Let's have a look at the rankings
                    head(mod)
    
                    ##Combining all the lists to form a data frame
                    onet_df<-data.frame(Name = nam,                                                                                     
                                        Modality = mod)
    
                    return(onet_df)
    }
    

    Then, I apply this function with a loop to a few therapists. For illustrative purposes, I take four adjacent therapists' ID, without knowing apriori whether each of these IDs have been actually assigned (this is done because I want to see what happens if the loop stumbles on a non-existen link).

    j <- 1
    MHP_codes <-  c(163805:163808) #therapist identifier
    df_list <- vector(mode = "list", length(MHP_codes))
      for(code1 in MHP_codes) {
        URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
        #Reading the HTML code from the website
        URL <- read_html(URL)
        df_list[[j]] <- tryCatch(getProfile(URL), 
                                 error = function(e) NULL)
        j <- j + 1
      }
    
    final_df <- rbind.fill(df_list)
    save(final_df,file="final_df.Rda")
    

提交回复
热议问题