What is the fastest way to make a uniq array?

问题

I've got the following situation. I have got a big array of random strings. This array should be made unique as fast as possible.

Now through some Benchmarking I found out that ruby's uniq is quite slow:

require 'digest'
require 'benchmark'

#make a nice random array of strings
list = (1..100000).to_a.map(&:to_s).map {|e| Digest::SHA256.hexdigest(e)}
list += list
list.shuffle

def hash_uniq(a)
  a_hash = {}
  a.each do |v|
    a_hash[v] = nil
  end
  a_hash.keys
end

Benchmark.bm do |x|
  x.report(:uniq) { 100.times { list.uniq} }
  x.report(:hash_uniq) { 100.times { hash_uniq(list) } }
end

Gist -> https://gist.github.com/stillhart/20aa9a1b2eeb0cff4cf5

The results are quite interesting. Could it be that ruby's uniq is quite slow?

          user     system      total        real
uniq      23.750000   0.040000  23.790000 ( 23.823770)
hash_uniq 18.560000   0.020000  18.580000 ( 18.591803)

Now my questions:

Are there any faster ways to make an array unique?
Am I doing something wrong?
Is there something wrong in the Array.uniq method?

I am using ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]

回答1:

String parsing operations on large data sets is certainly not where Ruby shines. If this is business critical, you might want to write an extension in something like C or Go, or let another application handle this before passing it to your Ruby application.

That said. There seems to be something strange with your benchmark. Running the same on my MacBook Pro using Ruby 2.2.3 renders the following result:

          user        system    total     real
uniq      10.300000   0.110000  10.410000 ( 10.412513)
hash_uniq 11.660000   0.210000  11.870000 ( 11.901917)

Suggesting that uniq is slightly faster.

If possible, you should always try to work with the right collection types. If your collection is truly unique, then use a Set. They feature better memory profile, and the faster lookup speeds of Hash, while retaining some of the Array intuition.

If your data is already in an Array, however, this might not be a good tradeoff, as insertion into Set is rather slow as well, as you can see here:

              user        system    total     real
uniq          11.040000   0.060000  11.100000 ( 11.102644)
hash_uniq     12.070000   0.230000  12.300000 ( 12.319356)
set_insertion 12.090000   0.200000  12.290000 ( 12.294562)

Where I added the following benchmark:

x.report(:set_insertion) { 100.times { Set.new(list) } }

来源：https://stackoverflow.com/questions/33276968/what-is-the-fastest-way-to-make-a-uniq-array

标签

arrays

ruby

unique