How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

后端未结

关注

 9  2503

梦如初夏 2020-12-03 03:25

I want to split a sentence into a list of words.

For English and European languages this is easy, just use split()

>>> \"This is a sentence.


      
      
        
          9条回答        

        
                    
            
            
                         
                
              
              
                
                   不知归路
                                             
                
                
                (楼主)
            
              
              
                2020-12-03 03:53
              

            
            
                        
Best tokenizer tool for Chinese is pynlpir.

import pynlpir
pynlpir.open()
mystring = "你汉语说的很好！"
tokenized_string = pynlpir.segment(mystring, pos_tagging=False)

>>> tokenized_string
['你', '汉语', '说', '的', '很', '好', '！']


Be aware of the fact that pynlpir has a notorious but easy fixable problem with licensing, on which you can find plenty of solutions on the internet.
You simply need to replace the NLPIR.user file in your NLPIR folder downloading a valide licence from this repository and restart your environment.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它9个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复