Keras Tokenizer num_words doesn't seem to work

后端未结

关注

 3  1896

>>> t = Tokenizer(num_words=3)
>>> l = [\"Hello, World! This is so&#$ fantastic!\", \"There is no other world like this one\"]
>>> t.f


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2020-12-15 05:34
              
            
            
                                                                       
There is nothing wrong in what you are doing. word_index is computed the same way no matter how many most frequent words you will use later (as you may see here). So when you will call any transformative method - Tokenizer will use only three most common words and at the same time, it will keep the counter of all words - even when it's obvious that it will not use it later.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2020-12-15 05:49
              
            
            
                                                                       
Limiting num_words to a small number (eg, 3) has no effect on fit_on_texts outputs such as word_index, word_counts, word_docs. It does have effect on texts_to_matrix. The resulting matrix will have num_words (3) columns.

>>> t = Tokenizer(num_words=3)
>>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"]
>>> t.fit_on_texts(l)
>>> print(t.word_index)
{'world': 1, 'this': 2, 'is': 3, 'hello': 4, 'so': 5, 'fantastic': 6, 'there': 7, 'no': 8, 'other': 9, 'like': 10, 'one': 11}

>>> t.texts_to_matrix(l, mode='count')
array([[0., 1., 1.],       
       [0., 1., 1.]])

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  甜味超标        
                
              
                            
                2020-12-15 06:00
              
            
            
                                                                       
Just a add on Marcin's answer ("it will keep the counter of all words - even when it's obvious that it will not use it later.").

The reason it keeps counter on all words is that you can call fit_on_texts multiple times. Each time it will update the internal counters, and when transformations are called, it will use the top words based on the updated counters.

Hope it helps.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复