Efficient recent set-membership tests in C for deduplication?

爷,独闯天下 提交于 2019-12-08 11:22:54

问题


I have an unlimited number of 12 byte messages arriving. The content can be treated as random and structureless. (The length is only important because it is shorter than most hashes.)

I want to deduplicate them.

One method is to store the last 1,000 messages in a circular buffer and check all 1,000 message for a match before accepting a message (and inserting it into the circular buffer for future checks).

What other methods are there, which could be more CPU and memory efficient?


回答1:


12 Bytes seem quite small length. You can cast your byte array to string and then use a tree structure based on strings, via exploiting strcmp().

  1. Way to cast byte array to string

  2. Tree structure based on strings

Unless you form a skewed tree, O(logn) would be your worst case for deduplication. In such case, it is not hard to change to a self-balancing tree too.

Here my BST implementation using string type keys:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>


struct Node
{
  char *key;
  char *value;
  struct Node *left;
  struct Node *right;
};

struct Node* newNode(char *strKey,char *strValue)
{
  struct Node *tmp = (struct Node*) malloc(sizeof(struct Node));
  tmp->key = strdup(strKey);
  tmp->value = strdup(strValue);
  tmp->left = NULL;
  tmp->right = NULL;
  return tmp;
}

struct Node* insert(struct Node* node, char *newKey, char *newValue)
{
  if (node == NULL)
    return newNode(newKey,newValue);

  int comparison = strcmp(newKey,node->key); 
  if (comparison < 0)
    node->left  = insert(node->left, newKey, newValue);
  else if (comparison > 0)
    node->right = insert(node->right, newKey, newValue);  
  else
  {
    printf("Error occured while insert to BST\n");
    return NULL;
  }

  return node;
}

struct Node* deleteNode(struct Node* node, char *key2del)
{
  if (node == NULL)
    return node;

  int comparison = strcmp(key2del,node->key);
  if (comparison < 0)
    node->left = deleteNode(node->left, key2del);
  else if (comparison > 0)
    node->right = deleteNode(node->right, key2del);
  else // where deletion occurs
  {
    if (node->left == NULL)
    {
      struct Node *tmp = node->right;
      free(node);
      return tmp;
    }
    else if (node->right == NULL)
    {
      struct Node *tmp = node->left;
      free(node);
      return tmp;
    }

    struct Node *tmp = node->right;
    while(tmp->left != NULL)
      tmp = tmp->left;

    node->key = tmp->key;
    node->right = deleteNode(node->right, tmp->key);
  }

  return node;
}


来源:https://stackoverflow.com/questions/49628455/efficient-recent-set-membership-tests-in-c-for-deduplication

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!