Faster way to count number of sets an item appears in?

眉间皱痕 提交于 2019-12-07 08:16:55

问题


I've got a list of bookmarks. Each bookmark has a list of keywords (stored as a HashSet). I also have a set of all possible keywords ("universe").

I want to find the keyword that appears in the most bookmarks.

I have 1356 bookmarks with a combined total of 698,539 keywords, with 187,358 unique.

If I iterate through every keyword in the universe and count the number of bookmarks it appears in, I'm doing 254,057,448 checks. This takes 35 seconds on my machine.

The algorithm is pretty simple:

var biggest = universe.MaxBy(kw => bookmarks.Count(bm => bm.Keywords.Contains(kw)));

Using Jon Skeet's MaxBy.

I'm not sure it's possible to speed this up much, but is there anything I can do? Perhaps parallelize it somehow?


dtb's solution takes under 200 ms to both build the universe and find the biggest element. So simple.

var freq = new FreqDict();
foreach(var bm in bookmarks) {
    freq.Add(bm.Keywords);
}
var biggest2 = freq.MaxBy(kvp => kvp.Value);

FreqDict is just a little class I made built on top of a Dictionary<string,int>.


回答1:


I don't have your sample data nor have I done any benchmarking, but I'll take a stab. One problem that could be improved upon is that most of the bm.Keywords.Contains(kw) checks are misses, and I think those can be avoided. The most constraining is the set of keywords any one given bookmark has (ie: it will typically be much smaller than universe) so we should start in that direction instead of the other way.

I'm thinking something along these lines. The memory requirement is much higher and since I haven't benchmarked anything, it could be slower, or not helpful, but I'll just delete my answer if it doesn't work out for you.

Dictionary<string, int> keywordCounts = new Dictionary<string, int>(universe.Length);
foreach (var keyword in universe)
{
    keywordCounts.Add(keyword, 0);
}

foreach (var bookmark in bookmarks)
{
    foreach (var keyword in bookmark.Keywords)
    {
        keywordCounts[keyword] += 1;
    }
}

var mostCommonKeyword = keywordCounts.MaxBy(x => x.Value).Key;



回答2:


You can get all keywords, group them, and get the biggest group. This uses more memory, but should be faster.

I tried this, and in my test it was about 80 times faster:

string biggest =
  bookmarks
  .SelectMany(m => m.Keywords)
  .GroupBy(k => k)
  .OrderByDescending(g => g.Count())
  .First()
  .Key;

Test run:

1536 bookmarks
153600 keywords
74245 unique keywords

Original:
12098 ms.
biggest = "18541"

New:
148 ms.
biggest = "18541"



回答3:


You don't need to iterate through whole universe. Idea is to create a lookup and track max.

    public Keyword GetMaxKeyword(IEnumerable<Bookmark> bookmarks)
    {
        int max = 0;
        Keyword maxkw = null;

        Dictionary<Keyword, int> lookup = new Dictionary<Keyword, int>();

        foreach (var item in bookmarks)
        {
            foreach (var kw in item.Keywords)
            {
                int val = 1;

                if (lookup.ContainsKey(kw))
                {
                    val = ++lookup[kw];
                }
                else
                {
                    lookup.Add(kw, 1);
                }

                if (max < val)
                {
                    max = val;
                    maxkw = kw;
                }
            }
        }

        return maxkw;
    }



回答4:


50ms in python:

>>> import random

>>> universe = set()
>>> bookmarks = []
>>> for i in range(1356):
...     bookmark = []
...     for j in range(698539//1356):
...         key_word = random.randint(1000, 1000000000)
...         universe.add(key_word)
...         bookmark.append(key_word)
...     bookmarks.append(bookmark)
...
>>> key_word_count = {}
>>> for bookmark in bookmarks:
...     for key_word in bookmark:
...         key_word_count[key_word] = key_word_count.get(key_word, 0) + 1
...

>>> print max(key_word_count, key=key_word_count.__getitem__)
408530590

>>> print key_word_count[408530590]
3
>>>


来源:https://stackoverflow.com/questions/11920344/faster-way-to-count-number-of-sets-an-item-appears-in

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!