Using str_word_count for UTF8 texts

前端未结

关注

 2  1454

说谎 2020-12-07 01:10

I have this text:

$text  = \"Başka, küskün otomobil kaçtı buraya küskün otomobil neden kaçtı
          kaçtı buraya, oraya KISMEN @here #there J.J.Johanson h


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   离开以前
                                             
                
                
                (楼主)
            
              
              
                2020-12-07 01:38
              

            
            
                        
You will never have a prefect solution of word-count, because word-count concept is not exists or too difficult in some languages. UTF8 or not does not matter.

Japanese and Chinese are not space tokenism language. They even don't have a static word list, you have to read the whole sentence before find verb and noun.

If you want to support multiple languages, you will need language specific tokenizer engine. You may research full-text index, tokenizer, CJK-tokenizer, CJK-analyzer for more information.

If you only want to support limited selected languages, just improve your regex patters with more and more cases.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复