Is there any command to do fuzzy matching in Linux based on multiple columns

问题

I have two csv file. File 1

D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot
2,66M,J,Rock,F,1995,201211.0
3,David,HM,Lee,M,,201211.0
6,66M,,Rock,F,,201211.0
0,David,H M,Lee,,1990,201211.0
3,Marc,H,Robert,M,2000,201211.0
6,Marc,M,Robert,M,,201211.0
6,Marc,MS,Robert,M,2000,201211.0
3,David,M,Lee,,1990,201211.0
5,Paul,ABC,Row,F,2008,201211.0
3,Paul,ACB,Row,,,201211.0
4,David,,Lee,,1990,201211.0
4,66,J,Rock,,1995,201211.0

File 2

PID,FNAME,MNAME,LNAME,GENDER,DOB
S2,66M,J,Rock,F,1995
S3,David,HM,Lee,M,1990
S0,Marc,HM,Robert,M,2000
S1,Marc,MS,Robert,M,2000
S6,Paul,,Row,M,2008
S7,Sam,O,Baby,F,2018

What I want to do is to use the crosswalk file, File 2, to back out those observations' PID in File 1 based on Columns FNAME,MNAME,LNAME,GENDER, and DOB. Because the corresponding information in observations of File 1 is not complete, I'm thinking of using fuzzy matching to back out their PID as many as possible (of course the level accuracy should be taken into account). For example, the observations with FNAME "Paul" and LNAME "Row" in File 1 should be assigned the same PID because there is only one similar observation in File 2. But for the observations with FNAME "Marc" and LNAME "Robert", Marc,MS,Robert,M,2000,201211.0 should be assigned PID "S1", Marc,H,Robert,M,2000,201211.0 PID "S0" and Marc,M,Robert,M,,201211.0 either "S0" or "S1".

Since I want to compensate File 1's PID as many as possible while keeping high accuracy, I consider three steps. First, use command to make sure that if and only if those information in FNAME,MNAME,LNAME,GENDER, and DOB are all completely matched, observations in File 1 can be assigned a PID. The output should be

D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,
3,Marc,H,Robert,M,2000,201211.0,
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,
3,David,M,Lee,,1990,201211.0,
5,Paul,ABC,Row,F,2008,201211.0,
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,
4,66,J,Rock,,1995,201211.0,

Next, write another command to guarantee that while DOB information are completely same, use fuzzy matching for FNAME,MNAME,LNAME,GENDER to back out File 1's observations' PID, which is not identified in the first step. So the output through these two steps is supposed to be

D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2

In the final step, use a new command to do fuzzy matching for all related columns, namely FNAME,MNAME,LNAME,GENDER, and DOB to compensate the remained observations' PID. So the final output is expected to be

D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2

I need to keep the order of File 1's observations so it must be kind of leftouter join. Because my original data size is about 100Gb, I want to use Linux to deal with my issue. But I have no idea how to complete the last two steps through awk or any other command in Linux. Is there anyone who can give me a favor? Thank you.

回答1:

Here is a shot at it with GNU awk (using PROCINFO["sorted_in"] to pick the most suitable candidate). It hashes the file2's field values per field and attaches the PID to the value, like field[2]["66M"]="S2" and for each record in file1 counts the amounts of PID matches and prints the one with the biggest count:

BEGIN {
    FS=OFS=","
    PROCINFO["sorted_in"]="@val_num_desc"
}
NR==FNR {                                                      # file2
    for(i=1;i<=6;i++)                                          # fields 1-6
        if($i!="") {
        field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
    }
    next
}
{                                                               # file1
        for(i=1;i<=6;i++) {                                     # fields 1-6
            if($i in field[i]) {                                # if value matches
                split(field[i][$i],t,FS)                        # get PIDs
                for(j in t) {                                   # and
                    matches[t[j]]++                             # increase PID counts
                }
            } else {                                            # if no value match
                for(j in field[i])                              # for all field values
                    if($i~j || j~$i)                            # "go fuzzy" :D
                        matches[field[i][j]]+=0.5               # fuzzy is half a match
            }
        }
        for(i in matches) {                                     # the best match first
            print $0,i
            delete matches
            break                                               # we only want the best match
        }
}

Output:

D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2

The "fuzzy match" here is naivistic if($i~j || j~$i) but feel free to replace it with any approximate matching algorithm, for example there are a few implementations of the Levenshtein distance algorithms floating in the internets. Rosetta seems to have one.

You didn't mention how big file2 is but if it's way beyond your memory capasity, you may want to consider spliting the files somehow.

Update: A version that maps file1 fields to file2 fields (as mentioned in comments):

BEGIN {
    FS=OFS=","
    PROCINFO["sorted_in"]="@val_num_desc"
    map[1]=1                                                   # map file1 fields to file2 fields
    map[2]=3
    map[3]=4
    map[4]=2
    map[5]=5
    map[7]=6
}
NR==FNR {                                                      # file2
    for(i=1;i<=6;i++)                                          # fields 1-6
        if($i!="") {
        field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
    }
    next
}
{                                                              # file1
    for(i in map) {
        if($i in field[map[i]]) {                              # if value matches
            split(field[map[i]][$i],t,FS)                      # get PIDs
            for(j in t) {                                      # and
                matches[t[j]]++                                # increase PID counts
            }
        } else {                                               # if no value match
            for(j in field[map[i]])                            # for all field values
                if($i~j || j~$i)                               # "go fuzzy" :D
                    matches[field[map[i]][j]]+=0.5             # fuzzy is half a match
        }
    }
    for(i in matches) {                                        # the best match first
        print $0,i
        delete matches
        break                                                  # we only want the best match
    }
}

来源：https://stackoverflow.com/questions/58254198/is-there-any-command-to-do-fuzzy-matching-in-linux-based-on-multiple-columns

标签

Linux

join

awk

levenshtein-distance