Remove duplicate records based on multiple columns?

大城市里の小女人 提交于 2019-11-26 18:50:34

问题


I'm using Heroku to host my Ruby on Rails application and for one reason or another, I may have some duplicate rows.

Is there a way to delete duplicate records based on 2 or more criteria but keep just 1 record of that duplicate collection?

In my use case, I have a Make and Model relationship for cars in my database.

Make      Model
---       ---
Name      Name
          Year
          Trim
          MakeId

I'd like to delete all Model records that have the same Name, Year and Trim but keep 1 of those records (meaning, I need the record but only once). I'm using Heroku console so I can run some active record queries easily.

Any suggestions?


回答1:


class Model

  def self.dedupe
    # find all models and group them on keys which should be common
    grouped = all.group_by{|model| [model.name,model.year,model.trim,model.make_id] }
    grouped.values.each do |duplicates|
      # the first one we want to keep right?
      first_one = duplicates.shift # or pop for last one
      # if there are any more left, they are duplicates
      # so delete all of them
      duplicates.each{|double| double.destroy} # duplicates can now be destroyed
    end
  end

end

Model.dedupe
  • Find All
  • Group them on keys which you need for uniqueness
  • Loop on the grouped model's values of the hash
  • remove the first value because you want to retain one copy
  • delete the rest



回答2:


If your User table data like below

User.all =>
[
    #<User id: 15, name: "a", email: "a@gmail.com", created_at: "2013-08-06 08:57:09", updated_at: "2013-08-06 08:57:09">, 
    #<User id: 16, name: "a1", email: "a@gmail.com", created_at: "2013-08-06 08:57:20", updated_at: "2013-08-06 08:57:20">, 
    #<User id: 17, name: "b", email: "b@gmail.com", created_at: "2013-08-06 08:57:28", updated_at: "2013-08-06 08:57:28">, 
    #<User id: 18, name: "b1", email: "b1@gmail.com", created_at: "2013-08-06 08:57:35", updated_at: "2013-08-06 08:57:35">, 
    #<User id: 19, name: "b11", email: "b1@gmail.com", created_at: "2013-08-06 09:01:30", updated_at: "2013-08-06 09:01:30">, 
    #<User id: 20, name: "b11", email: "b1@gmail.com", created_at: "2013-08-06 09:07:58", updated_at: "2013-08-06 09:07:58">] 
1.9.2p290 :099 > 

Email id's are duplicate, so our aim is to remove all duplicate email ids from user table.

Step 1:

To get all distinct email records id.

ids = User.select("MIN(id) as id").group(:email,:name).collect(&:id)
=> [15, 16, 18, 19, 17]

Step 2:

To remove duplicate id's from user table with distinct email records id.

Now the ids array holds the following ids.

[15, 16, 18, 19, 17]
User.where("id NOT IN (?)",ids)  # To get all duplicate records
User.where("id NOT IN (?)",ids).destroy_all

** RAILS 4 **

ActiveRecord 4 introduces the .not method which allows you to write the following in Step 2:

User.where.not(id: ids).destroy_all



回答3:


Similar to @Aditya Sanghi 's answer, but this way will be more performant because you are only selecting the duplicates, rather than loading every Model object into memory and then iterating over all of them.

# returns only duplicates in the form of [[name1, year1, trim1], [name2, year2, trim2],...]
duplicate_row_values = Model.select('name, year, trim, count(*)').group('name, year, trim').having('count(*) > 1').pluck(:name, :year, :trim)

# load the duplicates and order however you wantm and then destroy all but one
duplicate_row_values.each do |name, year, trim|
  Model.where(name: name, year: year, trim: trim).order(id: :desc)[1..-1].map(&:destroy)
end

Also, if you truly don't want duplicate data in this table, you probably want to add a multi-column unique index to the table, something along the lines of:

add_index :models, [:name, :year, :trim], unique: true, name: 'index_unique_models' 



回答4:


You could try the following: (based on previous answers)

ids = Model.group('name, year, trim').pluck('MIN(id)')

to get all valid records. And then:

Model.where.not(id: ids).destroy_all

to remove the unneeded records. And certainly, you can make a migration that adds a unique index for the three columns so this is enforced at the DB level:

add_index :models, [:name, :year, :trim], unique: true



回答5:


To run it on a migration I ended up doing like the following (based on the answer above by @aditya-sanghi)

class AddUniqueIndexToXYZ < ActiveRecord::Migration
  def change
    # delete duplicates
    dedupe(XYZ, 'name', 'type')

    add_index :xyz, [:name, :type], unique: true
  end

  def dedupe(model, *key_attrs)
    model.select(key_attrs).group(key_attrs).having('count(*) > 1').each { |duplicates|
      dup_rows = model.where(duplicates.attributes.slice(key_attrs)).to_a
      # the first one we want to keep right?
      dup_rows.shift

      dup_rows.each{ |double| double.destroy } # duplicates can now be destroyed
    }
  end
end



回答6:


You can try this sql query, to remove all duplicate records but latest one

DELETE FROM users USING users user WHERE (users.name = user.name AND users.year = user.year AND users.trim = user.trim AND users.id < user.id);


来源:https://stackoverflow.com/questions/14124212/remove-duplicate-records-based-on-multiple-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!