How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

后端未结

关注

 9  2499

I want to split a sentence into a list of words.

For English and European languages this is easy, just use split()

>>> \"This is a sentence.


                      
              相关标签:


      
      
        
          9条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2020-12-03 04:14
              
            
            
                                                                       
Languages like Chinese have a very fluid definition of a word. E.g. One meaning of ma is "horse". One meaning of shang is "above" or "on top of". A compound is "mashang" which means literally "on horseback" but is used figuratively to mean "immediately". You need a very good dictionary with compounds in it and looking up the dictionary needs a longest-match approach. Compounding is rife in German (famous example is something like "Danube steam navigation company director's wife" being expressed as one word), Turkic languages, Finnish, and Magyar -- these languages have very long words many of which won't be found in a dictionary and need breaking down to understand them.

Your problem is one of linguistics, nothing to do with Python.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南方客        
                
              
                            
                2020-12-03 04:15
              
            
            
                                                                       

  if str longer than 30 then take 27 chars and add '...' at the end

  otherwise return str


str='中文2018-2020年一区6、8、10、12号楼_「工程建设文档102332号」'
result = len(list(str)) >= 30 and ''.join(list(str)[:27]) + '...' or str

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2020-12-03 04:17
              
            
            
                                                                       
You can do this but not with standard library functions. And regular expressions won't help you either.

The task you are describing is part of the field called Natural Language Processing (NLP). There has been quite a lot of work done already on splitting Chinese words at word boundaries. I'd suggest that you use one of these existing solutions rather than trying to roll your own.


Chinese NLP
chinese - The Stanford NLP (Natural Language Processing) Group



  Where does the ambiguity come from?


What you have listed there is Chinese characters. These are roughly analagous to letters or syllables in English (but not quite the same as NullUserException points out in a comment). There is no ambiguity about where the character boundaries are - this is very well defined. But you asked not for character boundaries but for word boundaries. Chinese words can consist of more than one character.

If all you want is to find the characters then this is very simple and does not require an NLP library. Simply decode the message into a unicode string (if it is not already done) then convert the unicode string to a list using a call to the builtin function list. This will give you a list of the characters in the string. For your specific example:

>>> list(u"这是一个句子")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复