Most efficient way to match (a few) strings in C?

问题

Our system needs to accept user input from a terminal and match against a few known keyword strings (maybe 10).

We don't have the space/computrons to do regexp etc., code needs to be tiny & quick.

Now, the nasty way to do this is:

   // str is null-terminated, assume we know it's safe/sane here
   if(!strncmp(str,"hello",5)
   {
      do_hello();
   }
   else if(!strncmp(str,"world",5)
   {
      do_world();
   }
   else
   {
      meh(); // Wasn't a match
   }

So, after a bit of googling & reading I'm being convinced that a nicer way is to pre-compute the hash of the various matches as an int, and then just use a case statement:

// Assume hash() stops at NULL
switch(hash(str))
{
   case HASH_OF_HELLO:
      do_hello();
      break;

   case HASH_OF_WORLD:
      do_world();
      break;

   default:
      meh();
      break;
}

We can compute the *HASH_OF_match* at compile time. This seems potentially a faster / more elegant way to pick a string from a relatively small set.

So - does this seem reasonable? / Is there a glaring problem with doing this? / Anyone got a more elegant way of doing it?

As a footnote, this is the nicest looking hash algorithm I've seen this afternoon ;), credited to dan bernstein, it looks up to the job at hand.

unsigned int
get_hash(const char* s)
{
    unsigned int hash = 0;
    int c;

    while((c = *s++))
    {
        // hash = hash * 33 ^ c 
        hash = ((hash << 5) + hash) ^ c;
    }

    return hash;
}

回答1:

The problem with hashing is that an arbitrary string entered by the user may generate the same hash as one of your matches and you'll execute the wrong stuff. For a search set as small as 10 I'd just stick to the if-else approach. Or use a string array and function pointer array (assuming all functions have the same signature) to select the function to execute.

char const *matches[10] = {"first", "second", ..., "tenth"};
void (*fn[10])(void) = {&do_first, &do_second, ..., &do_tenth};

for( i = 0; i < 10; ++i ) {
  if( strcmp( str, matches[i] ) == 0 ) {
    (*fn[i])();
  }
}

回答2:

What about just using a nested switch statement on the last character as in the Boyer-Moore string search algorithm

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

回答3:

Sounds like you want to use gperf.

回答4:

Hashing and hash tables work best for large amounts of data. Since the number of input strings are known and limited, you could perhaps consider this approach:

Lets assume the known strings are

const char* STR_TABLE [STR_N] =
{
  "hello",
  "world",
  "this",
  "is",
  "a",
  "number",
  "of",
  "ten",
  "test",
  "strings"
};

Then we can sort them manually in alphabetic order, before compiling, since a sorted table gives much faster possibilities of searching. You can then use binary search.

#include <stdio.h>
#include <stdlib.h>

#define STR_N 10


const char* STR_TABLE [STR_N] =
{
  "a",
  "hello",
  "is",
  "number",
  "of",
  "strings",
  "ten",
  "test",
  "this",
  "world"
};


int ptr_strcmp (const void* str1, const void* str2)
{
  return strcmp(str1, *(const char**)str2);
}

int main()
{
  const char* user_input = "world"; // worst case
  const char** result;

  result = bsearch (user_input,
                    STR_TABLE,
                    STR_N,
                    sizeof(const char*),
                    ptr_strcmp);

  if(result != NULL)
  {
    printf("%s\n", *result);
  }
  else
  {
    printf("meh\n");
  }

}

This will boil down to:

Compare "world" with "of", 1 comparison 'w' != 'o'.

Compare "world" with "test", 1 comparison 'w' != 't'.

Compare "world" with "this", 1 comparison 'w' != 't'.

Compare "world" with "world", 5 comparisons.

Total number of comparisons are 8.

There is of course some overhead code involved in this, checks against '\0' and the binary search call. You'll have to measure the various methods suggested, on your specific platform to find out the best one.

回答5:

Maybe a solution could be this:

struct keyword {
    unsigned int hash;
    const char *str;
    void (*job)();
};

//A table with our keywords with their corresponding hashes. If you could not
//compute the hash at compile time, a simple init() function at the beginning
//of your program could initialize each entry by using the value in 'str'
//You could also implement a dynamic version of this table (linked list of keywords)
//for extending your keyword table during runtime
struct keyword mykeywords[] = {
    {.hash = HASH_OF_HELLO, .str = "hello", .job = do_hello},
    {.hash = HASH_OF_WORLD, .str = "world", .job = do_world},
    ...
    {.str = 0} //signal end of list of keywords

};

void run(const char *cmd)
{
    unsigned int cmdhash = get_hash(cmd);
    struct keyword *kw = mykeywords;
    while(kw->str) {
        //If hash matches then compare the string, since we should consider hashing collisions too!
        //The order of conditions below is important
        if (kw->hash == cmdhash && !strcmp(cmd, kw->str)) { 
             kw->job();
             break;   
        }
        kw++;
    }
}

来源：https://stackoverflow.com/questions/12268062/most-efficient-way-to-match-a-few-strings-in-c

标签

string

embedded

match