Is it possible to map string to int faster than using hashmap?

前端未结

关注

 7  1249

I understand that I should not optimize every single spot of my program so please consider this question to be \"academic\"

I have maximum 100 strings and integer n

相关标签:

7条回答

太阳男子

2020-12-08 20:19

The standard hash map as well as a perfect hash function as mentioned above suffer from the relatively slow execution of the hash function itself. The sketched perfect hash function e.g. has up to 5 random accesses to an array.

It makes sense to measure or calculate the speed of the hash function and of string comparisons assuming that the functionality is done by one hash function evaluation, one lookup in a table and a linear search though a (linked) list containing the strings and their index to resolve the hash collisions. In many cases it is better to use a simpler but faster hash function and accept more string comparisons than using a better but slower hash function and have less (standard hashmap) or even only one (perfect hash) comparison.

You will find a discussion of the related theme "switch on string" on my site as well as a bunch of solutions using a common test bed using macros as free C / C++ sources which solve the problem at runtime. I'm also thinking about a precompiler.

0 讨论(0)
发布评论:

提交评论
- 加载中...
长情又很酷

2020-12-08 20:22

Small addition to sehe’s post:

If you use a simple std::map the net effect is prefix search (because lexicographical string comparison shortcuts on the first character mismatch). The same thing goes for binary search in a sorted container.

You can harness the prefix search to be much more efficient. The problem with both std::map and naive binary search is that they will read the same prefix redundantly for each individual comparison, making the overall search O(m log n) where m is the length of the search string.

This is the reason why a hashmap outcompetes these two methods for large sets. However, there is a data structure which does not perform redundant prefix comparisons, and in fact needs to compare each prefix exactly once: a prefix (search) tree, more commonly known as trie, and looking up a single string of length m is feasible in O(m), the same asymptotic runtime you get for a hash table with perfect hashing.

Whether a trie or a (direct lookup) hash table with perfect hashing is more efficient for your purpose is a question of profiling.

0 讨论(0)
发布评论:

提交评论
- 加载中...
悲哀的现实

2020-12-08 20:25
A hashtable^[1] is in principle the fastest way.

You could however compile a Perfect Hash Function given the fact that you know the full domain ahead of time.

With a perfect hash, there need not be a collision, so you can store the hash table in a linear array!

With proper tweaking you can then
- fit all of the hash elements in a limited space, making direct addressing a potential option
- have a reverse lookup in O(1)
The 'old-school' tool for generating Perfect Hash functions would be gperf(1). The wikipedia lists more resources on the subject.
Because of all the debate I ran a demo:

Downloading NASDAQ ticker symbols and getting 100 random samples from that set, applying gperf as follows:
```
gperf -e ' \015' -L C++ -7 -C -E -k '*,1,$' -m 100 selection > perfhash.cpp
```
Results in a hash-value MAX_HASH_VALUE of 157 and a direct string lookup table of as many items. Here's just the hash function for demonstration purposes:
```
inline unsigned int Perfect_Hash::hash (register const char *str, register unsigned int len) {
  static const unsigned char asso_values[] = {
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156,  64,  40,   1,  62,   1,
       41,  18,  47,   0,   1,  11,  10,  57,  21,   7,
       14,  13,  24,   3,  33,  89,  11,   0,  19,   5,
       12,   0, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156
    };
  register int hval = len;

  switch (hval) {
      default: hval += asso_values[(unsigned char)str[4]];   /*FALLTHROUGH*/
      case 4:  hval += asso_values[(unsigned char)str[3]];   /*FALLTHROUGH*/
      case 3:  hval += asso_values[(unsigned char)str[2]+1]; /*FALLTHROUGH*/
      case 2:  hval += asso_values[(unsigned char)str[1]];   /*FALLTHROUGH*/
      case 1:  hval += asso_values[(unsigned char)str[0]];   break;
  }
  return hval;
}
```
It really doesn't get much more efficient. Do have a look at the full source at github: https://gist.github.com/sehe/5433535

Mind you, this is a perfect hash, too, so there will be no collisions
Q. [...] it's obviosly "DELL". Such lookup must be significantly faster than "hashmap lookup".

A: If you use a simple std::map the net effect is prefix search (because lexicographical string comparison shortcuts on the first character mismatch). The same thing goes for binary search in a sorted container.

^[1] PS. For 100 strings, a sorted array of string with std::search or std::lower_bound would potentially be as fast/faster due to the improved Locality of Reference. Consult your profile results to see whether this applies.
0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2020-12-08 20:25
If the strings are known at compile-time you can just use an enumeration:
```
enum
{
  Str1,
  Str2
};

const char *Strings = {
  "Str1",
  "Str2"
};
```
Using some macro tricks you can remove the redundancy of re-creating the table in two locations (using file inclusion and #undef).

Then lookup can be achieved as fast as indexing an array:
```
const char *string = Strings[Str1]; // set to "Str1"
```
This would have optimal lookup time and locality of reference.
0 讨论(0)
发布评论:

提交评论
- 加载中...
南笙

2020-12-08 20:27

Well, you could store the strings in a binary tree and search there. While this has O(log n) theoretical performance, it may be a lot faster in practise if you only have a few keys, that are really long, and that already differ in the first few characters.

I.e. when comparing keys is cheaper than computing the hash function.

Furthermore, there are CPU caching effects and such that may (or may not) be beneficial.

However, with a fairly cheap hash function, the hash table will be hard to beat.

0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-12-08 20:37

(Yet) Another small addition to sehe's answer:

Apart from Perfect Hash Functions, there's this Minimal Perfect Hash Function thing, and respectively C Minimal Perfect Hash Function. It is almost the same as gperf, except that:

gperf is a bit different, since it was conceived to create very fast perfect hash functions for small sets of keys and CMPH Library was conceived to create minimal perfect hash functions for very large sets of keys

The CMPH Library encapsulates the newest and more efficient algorithms in an easy-to-use, production-quality, fast API. The library was designed to work with big entries that cannot fit in the main memory. It has been used successfully for constructing minimal perfect hash functions for sets with more than 100 million of keys, and we intend to expand this number to the order of billion of keys

source: http://cmph.sourceforge.net/

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页