Awk - Compare every line to find a duplicate field, and add some wording to the end of line

问题

I have a file like this file, and I am trying to verify one field of each line, and add some wording if that field has a duplicate earlier in the file.

\\FILE04\BUET-PCO;\\SERVER24\DFS\SHARED\CORP\ET\PROJECT CONTROL OFFICE;/FS7_150a/FILE04/BU-D/PROJECT CONTROL OFFICE;10000bytes;9888;;;
\\FILE12\BUAG-GOLDMINE$;\\SERVER24\DFS\SHARED\CAN\AGENCY\GOLDMINE;/FS3_150a/FILE12/BU/AGENCY/GOLDMINE;90000bytes;98834;;;
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;;;
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;other stuff;;

In my example Line #3 and #4 have the same physical path. I would to have a script that could compare third field for example /FS3_150a/FILE12/BU/GB/BUSINTEG against the same file, and if it found the exact match to print something like "same physical path as Line #" for both cases,

\\FILE04\BUET-PCO;\\SERVER24\DFS\SHARED\CORP\ET\PROJECT CONTROL OFFICE;/FS7_150a/FILE04/BU-D/PROJECT CONTROL OFFICE;10000bytes;9888;;;
\\FILE12\BUAG-GOLDMINE$;\\SERVER24\DFS\SHARED\CAN\AGENCY\GOLDMINE;/FS3_150a/FILE12/BU/AGENCY/GOLDMINE;90000bytes;98834;;;
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;;;Same Physical Path as Line #4
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;other stuff;; Same Physical Path as Line #3

回答1:

Here's one way using GNU awk. It is a little hackish, YMMV. Run like:

awk -f script.awk file.txt{,}

Contents of script.awk:

BEGIN {
    FS = ";"
}

FNR==NR {
    array[$3]=array[$3] "#" NR
    next
}

{
    if ($3 in array && array[$3] ~ /#.#/) {
        copy = array[$3]
        sub("#"FNR, "", copy)
        printf "%s Same Physical Path as Line as %s\n", $0, copy
    }
    else {
        print
    }
}

Results:

\\FILE04\BUET-PCO;\\SERVER24\DFS\SHARED\CORP\ET\PROJECT CONTROL OFFICE;/FS7_150a/FILE04/BU-D/PROJECT CONTROL OFFICE;10000bytes;9888;;;
\\FILE12\BUAG-GOLDMINE$;\\SERVER24\DFS\SHARED\CAN\AGENCY\GOLDMINE;/FS3_150a/FILE12/BU/AGENCY/GOLDMINE;90000bytes;98834;;;
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;;; Same Physical Path as Line as #4
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;other stuff;; Same Physical Path as Line as #3

回答2:

This code tackles a simplified version of your problem. It identifies each line that contains a duplicate value compared to a previous line in field 3. It doesn't handle the tagging of a line that has subsequent duplicates.

awk -F';' '{ tag = ""
             if (field3[$3] != 0) tag = " Same physical path as line " field3[$3]
             else field3[$3] = NR
             printf "%s%s\n", $0, tag
           }' "$@"

There are probably other ways to organize it, but the key point is to use the associative array field3 to keep track of which names are seen in field 3 and the line number at which a given name was first seen. This assumes you're processing a single file of input. Lookup FNR etc if you must process multiple files (but you have to decide whether the same name can appear in different files or not).

It works almost as desired on the data given:

\\FILE04\BUET-PCO;\\SERVER24\DFS\SHARED\CORP\ET\PROJECT CONTROL OFFICE;/FS7_150a/FILE04/BU-D/PROJECT CONTROL OFFICE;10000bytes;9888;;;
\\FILE12\BUAG-GOLDMINE$;\\SERVER24\DFS\SHARED\CAN\AGENCY\GOLDMINE;/FS3_150a/FILE12/BU/AGENCY/GOLDMINE;90000bytes;98834;;;
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;;;
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;other stuff;; Same physical path as line 3

The difficulty with producing the 'tag' on line 3 is predicting the future; it is hard. To do that, you'd have to slurp the entire file into memory, keeping tabs on the line numbers where a given value in field 3 appears (in general, that could be an extensive list of line numbers), and then iterating through the data and tagging appropriately. Very, very much harder to do; I'd prefer to use Perl to awk for that job, though it is probably feasible to organize the data correctly in awk too.

Were it me, I'd be OK with the 90% of the job done; lines with duplicates are identified. If you want the last 10% done, expect it to take the other 90% of the time planned for the first phase.

来源：https://stackoverflow.com/questions/12631430/awk-compare-every-line-to-find-a-duplicate-field-and-add-some-wording-to-the

标签

text

awk

duplicates