ActiveRecord find_each combined with limit and order

后端 未结 13 2084
温柔的废话
温柔的废话 2020-12-02 11:41

I\'m trying to run a query of about 50,000 records using ActiveRecord\'s find_each method, but it seems to be ignoring my other parameters like so:



        
相关标签:
13条回答
  • One option is to put an implementation tailored for your particular model into the model itself (speaking of which, id is usually a better choice for ordering records, created_at may have duplicates):

    class Thing < ActiveRecord::Base
      def self.find_each_desc limit
        batch_size = 1000
        i = 1
        records = self.order(created_at: :desc).limit(batch_size)
        while records.any?
          records.each do |task|
            yield task, i
            i += 1
            return if i > limit
          end
          records = self.order(created_at: :desc).where('id < ?', records.last.id).limit(batch_size)
        end
      end
    end
    

    Or else you can generalize things a bit, and make it work for all the models:

    lib/active_record_extensions.rb:

    ActiveRecord::Batches.module_eval do
      def find_each_desc limit
        batch_size = 1000
        i = 1
        records = self.order(id: :desc).limit(batch_size)
        while records.any?
          records.each do |task|
            yield task, i
            i += 1
            return if i > limit
          end
          records = self.order(id: :desc).where('id < ?', records.last.id).limit(batch_size)
        end
      end
    end
    
    ActiveRecord::Querying.module_eval do
      delegate :find_each_desc, :to => :all
    end
    

    config/initializers/extensions.rb:

    require "active_record_extensions"
    

    P.S. I'm putting the code in files according to this answer.

    0 讨论(0)
  • 2020-12-02 12:43

    As remarked by @Kirk in one of the comments, find_each supports limit as of version 5.1.0.

    Example from the changelog:

    Post.limit(10_000).find_each do |post|
      # ...
    end
    

    The documentation says:

    Limits are honored, and if present there is no requirement for the batch size: it can be less than, equal to, or greater than the limit.

    (setting a custom order is still not supported though)

    0 讨论(0)
  • 2020-12-02 12:43

    Adding find_in_batches_with_order did solve my usecase, where I was having ids already but need batching and ordering. It was inspired by @dirk-geurs solution

    # Create file config/initializers/find_in_batches_with_order.rb with follwing code.
    ActiveRecord::Batches.class_eval do
      ## Only flat order structure is supported now
      ## example: [:forename, :surname] is supported but [:forename, {surname: :asc}] is not supported
      def find_in_batches_with_order(ids: nil, order: [], batch_size: 1000)
        relation = self
        arrangement = order.dup
        index = order.find_index(:id)
    
        unless index
          arrangement.push(:id)
          index = arrangement.length - 1
        end
    
        ids ||= relation.order(*arrangement).pluck(*arrangement).map{ |tupple| tupple[index] }
        ids.each_slice(batch_size) do |chunk_ids|
          chunk_relation = relation.where(id: chunk_ids).order(*order)
          yield(chunk_relation)
        end
      end
    end
    

    Leaving Gist here https://gist.github.com/the-spectator/28b1176f98cc2f66e870755bb2334545

    0 讨论(0)
  • 2020-12-02 12:43

    Do it in one query and avoid iterating:

    User.offset(2).order('name DESC').last(3)

    will product a query like this

    SELECT "users".* FROM "users" ORDER BY name ASC LIMIT $1 OFFSET $2 [["LIMIT", 3], ["OFFSET", 2]

    0 讨论(0)
  • 2020-12-02 12:44

    I had the same problem with a query with DISTINCT ON where you need an ORDER BY with that field, so this is my approach with Postgres:

    def filtered_model_ids
      Model.joins(:father_model)
           .select('DISTINCT ON (model.field) model.id')
           .order(:field)
           .map(&:id)
    end
    
    def processor
      filtered_model_ids.each_slice(BATCH_SIZE).lazy.each do |batch|
        Model.find(batch).each do |record|
          # Code
        end
      end
    end
    
    0 讨论(0)
  • 2020-12-02 12:45

    The documentation says that find_each and find_in_batches don't retain sort order and limit because:

    • Sorting ASC on the PK is used to make the batch ordering work.
    • Limit is used to control the batch sizes.

    You could write your own version of this function like @rorra did. But you can get into trouble when mutating the objects. If for example you sort by created_at and save the object it might come up again in one of the next batches. Similarly you might skip objects because the order of results has changed when executing the query to get the next batch. Only use that solution with read only objects.

    Now my primary concern was that I didn't want to load 30000+ objects into memory at once. My concern was not the execution time of the query itself. Therefore I used a solution that executes the original query but only caches the ID's. It then divides the array of ID's into chunks and queries/creates the objects per chunk. This way you can safely mutate the objects because the sort order is kept in memory.

    Here is a minimal example similar to what I did:

    batch_size = 512
    ids = Thing.order('created_at DESC').pluck(:id) # Replace .order(:created_at) with your own scope
    ids.each_slice(batch_size) do |chunk|
        Thing.find(chunk, :order => "field(id, #{chunk.join(',')})").each do |thing|
          # Do things with thing
        end
    end
    

    The trade-offs to this solution are:

    • The complete query is executed to get the ID's
    • An array of all the ID's is kept in memory
    • Uses the MySQL specific FIELD() function

    Hope this helps!

    0 讨论(0)
提交回复
热议问题