Computing the mode (most frequent element) of a set in linear time?

点点圈 提交于 2020-01-01 09:53:34

问题


In the book "The Algorithm Design Manual" by Skiena, computing the mode (most frequent element) of a set, is said to have a Ω(n log n) lower bound (this puzzles me), but also (correctly i guess) that no faster worst-case algorithm exists for computing the mode. I'm only puzzled by the lower bound being Ω(n log n).

See the page of the book on Google Books

But surely this could in some cases be computed in linear time (best case), e.g. by Java code like below (finds the most frequent character in a string), the "trick" being to count occurences using a hashtable. This seems obvious.

So, what am I missing in my understanding of the problem?

EDIT: (Mystery solved) As StriplingWarrior points out, the lower bound holds if only comparisons are used, i.e. no indexing of memory, see also: http://en.wikipedia.org/wiki/Element_distinctness_problem

// Linear time
char computeMode(String input) {
  // initialize currentMode to first char
  char[] chars = input.toCharArray();
  char currentMode = chars[0];
  int currentModeCount = 0;
  HashMap<Character, Integer> counts = new HashMap<Character, Integer>();
  for(char character : chars) {
    int count = putget(counts, character); // occurences so far
    // test whether character should be the new currentMode
    if(count > currentModeCount) {
      currentMode = character;
      currentModeCount = count; // also save the count
    }
  }
  return currentMode;
}

// Constant time
int putget(HashMap<Character, Integer> map, char character) {
  if(!map.containsKey(character)) {
    // if character not seen before, initialize to zero
    map.put(character, 0);
  }
 // increment
  int newValue = map.get(character) + 1;
  map.put(character, newValue);
  return newValue;
}

回答1:


The author seems to be basing his logic on the assumption that comparison is the only operation available to you. Using a Hash-based data structure sort of gets around this by reducing the likelihood of needing to do comparisons in most cases to the point where you can basically do this in constant time.

However, if the numbers were hand-picked to always produce hash collisions, you would end up effectively turning your hash set into a list, which would make your algorithm into O(n²). As the author points out, simply sorting the values into a list first provides the best guaranteed algorithm, even though in most cases a hash set would be preferable.




回答2:


So, what am I missing in my understanding of the problem?

In many particular cases, an array or hash table suffices. In "the general case" it does not, because hash table access is not always constant time.

In order to guarantee constant time access, you must be able to guarantee that the number of keys that can possibly end up in each bin is bounded by some constant. For characters this is fairly easy, but if the set elements were, say, doubles or strings, it would not be (except in the purely academic sense that there are, e.g., a finite number of double values).




回答3:


Hash table lookups are amortized constant time, i.e., in general, the overall cost of looking up n random keys is O(n). In the worst case, they can be linear. Therefore, while in general they could reduce the order of mode calculation to O(n), in the worst case it would increase the order of mode calculation to O(n^2).



来源:https://stackoverflow.com/questions/4168622/computing-the-mode-most-frequent-element-of-a-set-in-linear-time

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!