问题
I've got the following situation. I have got a big array of random strings. This array should be made unique as fast as possible.
Now through some Benchmarking I found out that ruby's uniq is quite slow:
require 'digest'
require 'benchmark'
#make a nice random array of strings
list = (1..100000).to_a.map(&:to_s).map {|e| Digest::SHA256.hexdigest(e)}
list += list
list.shuffle
def hash_uniq(a)
a_hash = {}
a.each do |v|
a_hash[v] = nil
end
a_hash.keys
end
Benchmark.bm do |x|
x.report(:uniq) { 100.times { list.uniq} }
x.report(:hash_uniq) { 100.times { hash_uniq(list) } }
end
Gist -> https://gist.github.com/stillhart/20aa9a1b2eeb0cff4cf5
The results are quite interesting. Could it be that ruby's uniq is quite slow?
user system total real
uniq 23.750000 0.040000 23.790000 ( 23.823770)
hash_uniq 18.560000 0.020000 18.580000 ( 18.591803)
Now my questions:
Are there any faster ways to make an array unique?
Am I doing something wrong?
Is there something wrong in the Array.uniq method?
I am using ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]
回答1:
String parsing operations on large data sets is certainly not where Ruby shines. If this is business critical, you might want to write an extension in something like C or Go, or let another application handle this before passing it to your Ruby application.
That said. There seems to be something strange with your benchmark. Running the same on my MacBook Pro using Ruby 2.2.3
renders the following result:
user system total real
uniq 10.300000 0.110000 10.410000 ( 10.412513)
hash_uniq 11.660000 0.210000 11.870000 ( 11.901917)
Suggesting that uniq
is slightly faster.
If possible, you should always try to work with the right collection types. If your collection is truly unique, then use a Set. They feature better memory profile, and the faster lookup speeds of Hash
, while retaining some of the Array
intuition.
If your data is already in an Array
, however, this might not be a good tradeoff, as insertion into Set
is rather slow as well, as you can see here:
user system total real
uniq 11.040000 0.060000 11.100000 ( 11.102644)
hash_uniq 12.070000 0.230000 12.300000 ( 12.319356)
set_insertion 12.090000 0.200000 12.290000 ( 12.294562)
Where I added the following benchmark:
x.report(:set_insertion) { 100.times { Set.new(list) } }
来源:https://stackoverflow.com/questions/33276968/what-is-the-fastest-way-to-make-a-uniq-array