UTF8 encoding is longer than the max length 32766

前端未结

关注

 10  957

I\'ve upgraded my Elasticsearch cluster from 1.1 to 1.2 and I have errors when indexing a somewhat big string.

{
  \"error\": \"IllegalArgumentException[Docu


                      
              相关标签:


      
      
        
          10条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2020-11-29 02:30
              
            
            
                                                                       
One way of handling tokens that are over the lucene limit is to use the truncate filter. Similar to ignore_above for keywords. To demonstrate, I'm using 5.
Elasticsearch suggests to use ignore_above = 32766 / 4 = 8191 since UTF-8 characters may occupy at most 4 bytes.
https://www.elastic.co/guide/en/elasticsearch/reference/6.3/ignore-above.html

curl -H'Content-Type:application/json' localhost:9200/_analyze -d'{
  "filter" : [{"type": "truncate", "length": 5}],
  "tokenizer": {
    "type":    "pattern"
  },
  "text": "This movie \n= AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}'


Output: 

{
  "tokens": [
    {
      "token": "This",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "movie",
      "start_offset": 5,
      "end_offset": 10,
      "type": "word",
      "position": 1
    },
    {
      "token": "AAAAA",
      "start_offset": 14,
      "end_offset": 52,
      "type": "word",
      "position": 2
    }
  ]
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲哀的现实        
                
              
                            
                2020-11-29 02:30
              
            
            
                                                                       
In Solr v6+ I changed the field type to text_general and it solved my problem.

<field name="body" type="string" indexed="true" stored="true" multiValued="false"/>   
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  眼角桃花        
                
              
                            
                2020-11-29 02:34
              
            
            
                                                                       
Using logstash to index those long messages, I use this filter to truncate the long string :

    filter {
        ruby {
            code => "event.set('message_size',event.get('message').bytesize) if event.get('message')"
        }
        ruby {
            code => "
                if (event.get('message_size'))
                    event.set('message', event.get('message')[0..9999]) if event.get('message_size') > 32000
                    event.tag 'long message'  if event.get('message_size') > 32000
                end
            "
         }
     }


It adds a message_size field so that I can sort the longest messages by size.

It also adds the long message tag to those that are over 32000kb so I can select them easily.

It doesn't solve the problem if you intend to index those long messages completely, but if, like me, don't want to have them in elasticsearch in the first place and want to track them to fix it, it's a working solution.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  甜味超标        
                
              
                            
                2020-11-29 02:35
              
            
            
                                                                       
I got around this problem by changing my analyzer . 

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "standard" : {
                    "tokenizer": "standard",
                    "filter": ["standard", "lowercase", "stop"]
                }
            }
        }
    }
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复