UTF8 encoding is longer than the max length 32766

前端未结

关注

 10  985

I\'ve upgraded my Elasticsearch cluster from 1.1 to 1.2 and I have errors when indexing a somewhat big string.

{
  \"error\": \"IllegalArgumentException[Docu


                      
              相关标签:


      
      
        
          10条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2020-11-29 02:09
              
            
            
                                                                       
If you really want not_analyzed on on the property because you want to do some exact filtering then you can use "ignore_above": 256

Here is an example of how I use it in php:

'mapping'    => [
    'type'   => 'multi_field',
    'path'   => 'full',
    'fields' => [
        '{name}' => [
            'type'     => 'string',
            'index'    => 'analyzed',
            'analyzer' => 'standard',
        ],
        'raw' => [
            'type'         => 'string',
            'index'        => 'not_analyzed',
            'ignore_above' => 256,
        ],
    ],
],


In your case you probably want to do as John Petrone told you and set "index": "no" but for anyone else finding this question after, like me, searching on that Exception then your options are:


set "index": "no"
set "index": "analyze"
set "index": "not_analyzed" and "ignore_above": 256


It depends on if and how you want to filter on that property.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遥遥无期        
                
              
                            
                2020-11-29 02:17
              
            
            
                                                                       
If you are using searchkick, upgrade elasticsearch to >= 2.2.0 & make sure you are using searchkick 1.3.4 or later. 

This version of searchkick sets ignore_above = 256 by default, thus you won't get this error when UTF > 32766. 

This is discussed here. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轮回少年        
                
              
                            
                2020-11-29 02:18
              
            
            
                                                                       
I needed to change the index part of the mapping to no instead of not_analyzed. That way the value is not indexed. It remains available in the returned document (from a search, a get, …) but I can't query it.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长发绾君心        
                
              
                            
                2020-11-29 02:20
              
            
            
                                                                       
So you are running into an issue with the maximum size for a single term. When you set a field to not_analyzed it will treat it as one single term. The maximum size for a single term in the underlying Lucene index is 32766 bytes, which is I believe hard coded.

Your two primary options are to either change the type to binary or to continue to use string but set the index type to "no".
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  面向向阳花        
                
              
                            
                2020-11-29 02:20
              
            
            
                                                                       
I've stumbled upon the same error message with Drupal's Search api attachments module:


  Document contains at least one immense term in field="saa_saa_file_entity" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. 


Changing the fields type from string to Fulltext (in /admin/config/search/search-api/index/elastic_index/fields) solved the problem for me.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2020-11-29 02:28
              
            
            
                                                                       
There is a better option than the one John posted. Because with that solution you can't search on the value anymore.

Back to the problem:

The problem is that by default field values will be used as a single term (complete string).  If that term/string is longer than the 32766 bytes it can't be stored in Lucene .

Older versions of Lucene only registers a warning when terms are too long (and ignore the value). Newer versions throws an Exception. See bugfix: https://issues.apache.org/jira/browse/LUCENE-5472

Solution:

The best option is to define a (custom) analyzer on the field with the long string value. The analyzer can split out the long string in smaller strings/terms. That will fix the problem of too long terms.

Don't forget to also add an analyzer to the "_all" field if you are using that functionality. 

Analyzers can be tested with the REST api. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复