Splitting on regex without removing delimiters

后端未结

关注

 5  731

So, I would like to split this text into sentences.

s = \"You! Are you Tom? I am Danny.\"

so I get:

[\"You!\", \"Are you To


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  难免孤独        
                
              
                            
                2020-12-11 20:34
              
            
            
                                                                       
If you prefer use split method rather than match, one solution split with group

splitted = filter(None, re.split( r'(.*?[\.!\?])', s))


Filter removes empty strings if any.

This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)

It even possible to keep you re as is (with escaping correction and adding parenthesis).

splitted = filter(None, re.split( r'([\.!\?])', s))


Then merge even and uneven elements and remove extra spaces

Python split() without removing the delimiter
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  面向向阳花        
                
              
                            
                2020-12-11 20:40
              
            
            
                                                                       
Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:

>>> import re
>>> re.split(r'(?<=[\.\!\?])\s*', s)
['You!', 'Are you Tom?', 'I am Danny.']


This splits on whitespace, but only if it is preceded by either a ., !, or ? character.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2020-12-11 20:48
              
            
            
                                                                       
If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:

(?<=[.!?])


Demo: https://regex101.com/r/ZLDXr1/1

Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.

However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:

(?<=[.!?])\s+


Demo: https://regex101.com/r/ZLDXr1/2

Python demo: https://ideone.com/z6nZi5

If the spaces are optional, the re.findall solution suggested by @Psidom is the best one, I believe.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南旧        
                
              
                            
                2020-12-11 20:49
              
            
            
                                                                       
You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:

import re

s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  猫巷女王i        
                
              
                            
                2020-12-11 20:56
              
            
            
                                                                       
Easiest way is to use nltk.

import nltk   
nltk.sent_tokenize(s)


It will return a list of all your sentences without loosing delimiters.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复