Detecting duplicate lines on file using c

问题

I have a csv file with about (15000-25000) lines(of fixed size) and i want to know how can i detect duplicated lines using c language.

An example of the output is like this :

0123456789;CUST098WZAX;35

I have no memory or time constraint, so i want the simplest solution.

Thanks for your help.

回答1:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct somehash {
        struct somehash *next;
        unsigned hash;
        char *mem;
        };

#define THE_SIZE 100000
struct somehash *table[THE_SIZE] = { NULL,};

struct somehash **some_find(char *str, unsigned len);
static unsigned some_hash(char *str, unsigned len);

int main (void)
{
char buffer[100];
struct somehash **pp;
size_t len;

while (fgets(buffer, sizeof buffer, stdin)) {
        len = strlen(buffer);
        pp = some_find(buffer, len);
        if (*pp) { /* found */
                fprintf(stderr, "Duplicate:%s\n", buffer);
                }
        else    {       /* not found: create one */
                fprintf(stdout, "%s", buffer);
                *pp = malloc(sizeof **pp);
                (*pp)->next = NULL;
                (*pp)->hash = some_hash(buffer,len);
                (*pp)->mem = malloc(1+len);
                memcpy((*pp)->mem , buffer,  1+len);
                }
        }
return 0;
}
struct somehash **some_find(char *str, unsigned len)
{
unsigned hash;
unsigned slot;
struct somehash **hnd;

hash = some_hash(str,len);
slot = hash % THE_SIZE;
for (hnd = &table[slot]; *hnd ; hnd = &(*hnd)->next ) {
        if ( (*hnd)->hash != hash) continue;
        if ( strcmp((*hnd)->mem , str) ) continue;
        break;
        }
return hnd;
}

static unsigned some_hash(char *str, unsigned len)
{
unsigned val;
unsigned idx;

if (!len) len = strlen(str);

val = 0;
for(idx=0; idx < len; idx++ )   {
        val ^= (val >> 2) ^ (val << 5) ^ (val << 13) ^ str[idx] ^ 0x80001801;
        }
return val;
}

回答2:

I'm not sure if this is the simplest solution, but...

If every entry looks like this:

0123456789;CUST098WZAX;35

... and the final two characters are always a value from 00 - 99, you could use this value to index a bucket. This bucket is one item of an array of 100 (ie. 0-99, like the values), each of which points to a linked list of structures that store the strings belonging to that bucket.

With the strings sorted into buckets, the number of full-string comparisions required to identify duplicates is (hopefully) greatly reduced - just compare strings that are in the same bucket.

If all entries have the same value, this would put all entries in the same bucket, degrading this method to O(n^2) for the comparison step alone. But assuming a varied distribution of values this method will be faster in practice.

(I have, of course, just described a hash table, but with a more naive hash function than would normally be used.)

回答3:

The simplest algorithm:

Load original file in the memory as the array A of lines.
Create a separate array B of the same size.
Iterate over A. Do the linear search for the current line in B. If it is not in there, add it into B and into the output file.

This is very simple, brutal, inefficient O(n^2) solution. Very straightforward to implement, assuming you have basic C skills.

By the way, if the order does not matter, you may sort the file and then the task is even more straightforward. You just first sort the file, and then have the variable for the last line, against which you may check the current, and skip the current if it equals to the last one.

来源：https://stackoverflow.com/questions/10189594/detecting-duplicate-lines-on-file-using-c

标签

file-io

csv