问题
I need to count the number of elements corresponding to the intersection of two big arrays of strings and do it very fast.
I am using the following code:
arr1[i].Intersect(arr2[j]).Count()
For CPU Time, VS Profiler indicates
- 85.1% in
System.Linq.Enumerable.Count() - 0.3% in
System.Linq.Enumerable.Intersect()
Unfortunately it might to take hours to do all work.
How to do it faster?
回答1:
You can use HashSet with arr2
HashSet<string> arr2Set = new HashSet<string>(arr2);
arr1.Where(x=>arr2Set.Contains(x)).Count();
------------------
|
|->HashSet's contains method executes quickly using hash-based lookup..
Not considering the conversion from arr2 to arr2Set ,this should be O(n)
回答2:
I suspect the reason why the profiler shows the time being consumed in Count, is that this is where the collection is actually enumerated (the Intersect is lazily evaluated and does not run before you need the result).
I believe Intersect should have some internal optimizations to make this reasonably fast, but you could try using a HashSet<string> so you are sure the intersect can be made without searching through the inner array for each element:
HashSet<string> set = new HashSet<string>(arr1);
set.IntersectWith(arr2);
int count = set.Count;
回答3:
Hmmm Intersect is probably N^2
to make it faster quicksort both arrays. and than traverse both arrays. counting intersections.
too lazy to test how fast it would be but should O(nlogn +n)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Test
{
class Program
{
static void Main(string[] args)
{
const int arrsize = 1000000;
Random rnd = new Random(42);
string[] arr1 = new string[arrsize];
string[] arr2 = new string[arrsize];
for (int i = 0; i < arrsize; i++)
{
arr1[i] = rnd.Next().ToString();
arr2[i] = rnd.Next().ToString();
}
{
var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
arr1.Intersect(arr2).Count();
Console.WriteLine("array" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
}
{
HashSet<string> set = new HashSet<string>(arr1);
var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
set.IntersectWith(arr2);
int count = set.Count;
Console.WriteLine("HashSet" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
}
{
var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
HashSet<string> set = new HashSet<string>(arr1);
set.IntersectWith(arr2);
int count = set.Count;
Console.WriteLine("HashSet + new" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
}
{
var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
SortedSet<string> set = new SortedSet<string>(arr1);
set.IntersectWith(arr2);
int count = set.Count;
Console.WriteLine("SortedSet +new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
}
{
SortedSet<string> set = new SortedSet<string>(arr1);
var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
set.IntersectWith(arr2);
int count = set.Count;
Console.WriteLine("SortedSet without new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
}
}
}
}
results
array 914,637
HashSet 816,119
HashSet +new 1,150,978
SortedSet +new 16,173,836
SortedSet without new 7,946,709
so seems that best way is to keep a ready hash set.
回答4:
when you are working with sets, your complexity will be O((n log n)*(m log m)) or so,
i think this here should be faster, but i'm not sure if it is now O((n log n)+(m log m))
possible would be
var Set1 = arr1[i].Distinct().ToArray(); // if necessary, if arr1 or arr2 could be not distinct
var Set2 = arr2[j].Distinct().ToArray();
nCount = Set1.Count() + Set2.Count() - Set1.Append(Set2).Distinct().Count();
回答5:
Build a HashSet using the smaller array and then loop through the bigger one, incrementing a counter if the item it exists in the hashset.
来源:https://stackoverflow.com/questions/13900514/fast-count-intersection-of-two-string-arrays