How do I clean twitter data in R?

后端未结

关注

 5  1346

I extracted tweets from twitter using the twitteR package and saved them into a text file.

I have carried out the following on the corpus

xx<-tm


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2020-12-13 23:08
              
            
            
                                                                       
For me, this code did not work, for some reason-

# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")


Error was-

Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement),  : 
 Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)


So, instead, I used

clean_tweet4 <- str_replace_all(clean_tweet3, "https://t.co/[a-z,A-Z,0-9]*","")
clean_tweet5 <- str_replace_all(clean_tweet4, "http://t.co/[a-z,A-Z,0-9]*","")


to get rid of URLs
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦如初夏        
                
              
                            
                2020-12-13 23:13
              
            
            
                                                                       
Using gsub and 


  stringr package 


I have figured out part of the solution for removing retweets, references to screen names, hashtags, spaces, numbers, punctuations, urls . 

  clean_tweet = gsub("&amp", "", unclean_tweet)
  clean_tweet = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", clean_tweet)
  clean_tweet = gsub("@\\w+", "", clean_tweet)
  clean_tweet = gsub("[[:punct:]]", "", clean_tweet)
  clean_tweet = gsub("[[:digit:]]", "", clean_tweet)
  clean_tweet = gsub("http\\w+", "", clean_tweet)
  clean_tweet = gsub("[ \t]{2,}", "", clean_tweet)
  clean_tweet = gsub("^\\s+|\\s+$", "", clean_tweet) 


ref: ( Hicks , 2014)
After the above 
I did the below.

 #get rid of unnecessary spaces
clean_tweet <- str_replace_all(clean_tweet," "," ")
# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")
# Take out retweet header, there is only one
clean_tweet <- str_replace(clean_tweet,"RT @[a-z,A-Z]*: ","")
# Get rid of hashtags
clean_tweet <- str_replace_all(clean_tweet,"#[a-z,A-Z]*","")
# Get rid of references to other screennames
clean_tweet <- str_replace_all(clean_tweet,"@[a-z,A-Z]*","")   


ref: (Stanton 2013)

Before doing any of the above I collapsed the whole string into a single long character using the below.

paste(mytweets, collapse=" ")

This cleaning process has worked for me quite well as opposed to the tm_map transforms.

All that I am left with now is a set of proper words and a very few improper words.
Now, I only have to figure out how to remove the non proper english words. 
Probably i will have to subtract my set of words from a dictionary of words. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦如初夏        
                
              
                            
                2020-12-13 23:20
              
            
            
                                                                       
The code do some basic cleaning

Converts into lowercase

df <- tm_map(df, tolower)  


Removing Special characters

df <- tm_map(df, removePunctuation)


Removing Special characters

df <- tm_map(df, removeNumbers)


Removing common words

df <- tm_map(df, removeWords, stopwords('english'))


Removing URL

removeURL <- function(x) gsub('http[[:alnum;]]*', '', x)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  逝去的感伤        
                
              
                            
                2020-12-13 23:21
              
            
            
                                                                       
    library(tidyverse)    

    clean_tweets <- function(x) {
                x %>%
                        str_remove_all(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%
                        str_replace_all("&amp;", "and") %>%
                        str_remove_all("[[:punct:]]") %>%
                        str_remove_all("^RT:? ") %>%
                        str_remove_all("@[[:alnum:]]+") %>%
                        str_remove_all("#[[:alnum:]]+") %>%
                        str_replace_all("\\\n", " ") %>%
                        str_to_lower() %>%
                        str_trim("both")
        }

    tweets %>% clean_tweets

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2020-12-13 23:26
              
            
            
                                                                       
To remove the URLs you could try the following:

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
xx <- tm_map(xx, removeURL)


Possibly you could define similar functions to further transform the text.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复