Pandas split on regex

后端未结

关注

 3  385

I have pandas df with a column containing comma-delimited characteristics like so:

Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2020-12-19 06:13
              
            
            
                                                                       
Yes, split supports regex. According to your requirements,


  split based on a regex match of a comma followed by a space and a capital letter


you may use

df['incident_characteristics'].str.split(r'\s*,\s*(?=[A-Z])', expand=True)


See the regex demo.

Details


\s*,\s* - a comma enclosed with 0+ whitespaces
(?=[A-Z]) - only if followed with an uppercase ASCII letter


However, it seems you also don't want to match the comma inside parentheses, add (?![^()]*\)) lookahead that fails the match if, immediately to the right of the current location, there are 0+ chars other than ( and ) and then a ):

r'\s*,\s*(?=[A-Z])(?![^()]*\))'


and it will prevent matching commas before capitalized words inside parentheses (that has no parentheses inside).

See another regex demo.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长发绾君心        
                
              
                            
                2020-12-19 06:14
              
            
            
                                                                       
You can try .str.extractall (but I think there are better patterns than mine).

import pandas as pd

txt = 'Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect)'
df = pd.DataFrame({'incident_characteristics': [txt]})
df['incident_characteristics'].str.extractall(r'([\w\+\-\/ ]+(\([\w\+\-\/\, ]+\))?)')[0]


Output:

#    match
# 0  0                                   Shot - Wounded/Injured
#    1                Shot - Dead (murder, accidental, suicide)
#    2                                        Suicide - Attempt
#    3                                           Murder/Suicide
#    4         Attempted Murder/Suicide (one variable unsucc...
#    5                               Institution/Group/Business
#    6         Mass Murder (4+ deceased victims excluding th...
#    7         Mass Shooting (4+ victims injured or killed e...
# Name: 0, dtype: object


If you use .str.split, the first letter will be removed as it is used as a part of delimiter.

df['incident_characteristics'].str.split(r',\s[A-Z]', expand=True)


Output:

#                         0                                         1  \
# 0  Shot - Wounded/Injured  hot - Dead (murder, accidental, suicide)
#                   2              3  \
# 0  uicide - Attempt  urder/Suicide
#                                                    4  \
# 0  ttempted Murder/Suicide (one variable unsucces...
#                            5  \
# 0  nstitution/Group/Business
#                                                    6  \
# 0  ass Murder (4+ deceased victims excluding the ...
#                                                    7
# 0  ass Shooting (4+ victims injured or killed exc...

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  清歌不尽        
                
              
                            
                2020-12-19 06:35
              
            
            
                                                                       
I would first create the data and then feed it into a dataframe, like so

import pandas as pd, re

junk = """Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect"""

rx = re.compile(r'\([^()]+\)|,(\s+)')

data = [x 
        for nugget in rx.split(junk) if nugget
        for x in [nugget.strip()] if x]

df = pd.DataFrame({'incident_characteristics': data})
print(df)


This yields

                            incident_characteristics
0                             Shot - Wounded/Injured
1                                        Shot - Dead
2                                  Suicide - Attempt
3                                     Murder/Suicide
4                           Attempted Murder/Suicide
5                         Institution/Group/Business
6                                        Mass Murder
7  Mass Shooting (4+ victims injured or killed ex...


Additionally, this assumes that commas in parentheses should be ignored when splitting.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复