How to prevent attachments from being stored in _source with Elasticsearch and Tire?

天涯浪子 提交于 2019-12-21 19:24:54

问题


I've got some PDF attachments being indexed in Elasticsearch, using the Tire gem. It's all working great, but I'm going to have many GB of PDFs, and we will likely store the PDFs in S3 for access. Right now the base64-encoded PDFs are being stored in Elasticsearch _source, which will make the index huge. I want to have the attachments indexed, but not stored, and I haven't yet figured out the right incantation to put in Tire's "mapping" block to prevent it. The block is like this right now:

mapping do
  indexes :id, :type => 'integer'
  indexes :title
  indexes :last_update, :type => 'date'
  indexes :attachment, :type => 'attachment'
end

I've tried some variations like:

indexes :attachment, :type => 'attachment', :_source => { :enabled => false }

And it looks nice when I run the tire:import rake task, but it doesn't seem to make a difference. Does anyone know A) if this is possible? and B) how to do it?

Thanks in advance.


回答1:


The _source field settings contain a list of fields what should be excluded from the source. I would guess that in case of tire, something like this should do it:

mapping :_source => { :excludes => ['attachment'] } do
  indexes :id, :type => 'integer'
  indexes :title
  indexes :last_update, :type => 'date'
  indexes :attachment, :type => 'attachment'
end



回答2:


@imotov 's solution does not work for me. When I execute the curl command

curl -X GET "http://localhost:9200/user_files/user_file/_search?pretty=true" -d '{"query":{"query_string":{"query":"rspec"}}}'

I can still see the content of the attachment file included in the search results.

"_source" : {"user_file":{"id":5,"folder_id":1,"updated_at":"2012-08-16T11:32:41Z","attachment_file_size":179895,"attachment_updated_at":"2012-08-16T11:32:41Z","attachment_file_name":"hw4.pdf","attachment_content_type":"application/pdf","created_at":"2012-08-16T11:32:41Z","attachment_original":"JVBERi0xL .....

Here's my implementation:

include Tire::Model::Search
include Tire::Model::Callbacks

def self.search(folder, params)
  tire.search() do
    query { string params[:query], default_operator: "AND"} if params[:query].present?
    filter :term, folder_id: folder.id
    highlight :attachment_original, :options => {:tag => "<em>"}
  end
end

mapping :_source => { :excludes => ['attachment_original'] } do
  indexes :id, :type => 'integer'
  indexes :folder_id, :type => 'integer'
  indexes :attachment_file_name
  indexes :attachment_updated_at, :type => 'date'
  indexes :attachment_original, :type => 'attachment'
end

def to_indexed_json
   to_json(:methods => [:attachment_original])
end

def attachment_original
  if attachment_file_name.present?
    path_to_original = attachment.path
    Base64.encode64(open(path_to_original) { |f| f.read })
  end    
end


来源:https://stackoverflow.com/questions/11873248/how-to-prevent-attachments-from-being-stored-in-source-with-elasticsearch-and-t

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!