pandas replace (erase) different characters from strings

后端未结

关注

 3  1233

I have a list of high schools. I would like to erase certain characters, words, and symbols from the strings.

I currently have:

df[\'schoolname\'] =


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2020-12-09 22:32
              
            
            
                                                                       
My problem: I wanted to find a simple solution in deleting characters / symbols using the replace method with pandas.

I had the following array in a data frame:

  df = array(['2012', '2016', '2011', '2013', '2015', '2017', '2001', '2007',
   '[2005], ©2004.', '2005', '2009', '2008', '2009, c2008.', '2006',
   '2019', '[2003]', '2018', '2012, c2011.', '[2012]', 'c2012.',
   '2014', '2002', 'c2005.', '[2000]', 'c2000.', '2010',
   '2008, c2007.', '2011, c2010.', '2011, ©2002.', 'c2011.', '[2017]',
   'c1996.', '[2018]', '[2019]', '[2011]', '2000', '2000, c1995.',
   '[2004]', '2005, ©2004.', 'c2004.', '[2009]', 'c2009.', '[2014]',
   '1999', '[2010]', 'c2010.', '[2006]', '2007, 2006.', '[2013]',
   'c2001.', 'C2016.', '2008, c2006.', '2011, ©2010.', '2007, c2005.',
   '2009, c2005.', 'c2002.', '[2004], c2003.', '2009, c2007.', '2003',
   '©2003.', '[2016]', '[2001]', '2010, c2001.', '[1998]', 'c1998.'],
  dtype=object)


As you can see, the years were entered using multiple formats (ugh!) with brackets and copyright symbols and lowercase c and uppercase C. 

Now I wanted to remove those unwanted characters and only have the years in four digits. Since it's an array, you also need to transform it into a string before using replace(). Create a variable of all the characters you want replaced and separate them with ' | '.

rep_chars = 'c|C|\]|\[|©|\.'

df[Year] = df['Year'].str.replace(rep_chars,"")



  Make sure to use \. and not just the period. The same with \] and \[.


Output:

array(['2012', '2016', '2011', '2013', '2015', '2017', '2001', '2007',
   '2005, 2004', '2005', '2009', '2008', '2009, 2008', '2006', '2019',
   '2003', '2018', '2012, 2011', '2014', '2002', '2000', '2010',
   '2008, 2007', '2011, 2010', '2011, 2002', '1996', '2000, 1995',
   '2004', '1999', '2007, 2006', '2008, 2006', '2007, 2005',
   '2009, 2005', '2004, 2003', '2009, 2007', '2010, 2001', '1998'],
  dtype=object)


Happy Data Cleaning!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  清歌不尽        
                
              
                            
                2020-12-09 22:36
              
            
            
                                                                       
Use regex (seperate the strings by |):

df['schoolname'] = df['schoolname'].str.replace('high|school', "")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2020-12-09 22:42
              
            
            
                                                                       
You can create a dictionary and then .replace({}, regex=True) method:

replacements = {
   'schoolname': {
      r'(high|school)': ''}
}

df.replace(replacements, regex=True, inplace=True)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复