remove duplicate lines with similar prefix

前端 未结 4 2028
无人及你
无人及你 2021-01-15 17:04

I need to remove similar lines in a file which has duplicate prefix and keep the unique ones.

From this,

abc/def/ghi/
abc/def/ghi/jkl/one/
abc/def/gh         


        
4条回答
  •  轮回少年
    2021-01-15 17:46

    Step 1: This solution is based on assumption that reordering the output is allowed. If so, then it should be faster to reverse sort the input file before processing. By reverse sorting, we only need to compare 2 consecutive lines in each loop, no need to search all the file or all the "known prefixes". I understand that a line is defined as a prefix and should be removed if it is a prefix of any another line. Here is an example of remove prefixes in a file, reordering is allowed:

    #!/bin/bash
    
    f=sample.txt                                 # sample data
    
    p=''                                         # previous line = empty
    
    sort -r "$f" | \
      while IFS= read -r s || [[ -n "$s" ]]; do  # reverse sort, then read string (line)
        [[ "$s" = "${p:0:${#s}}" ]] || \
          printf "%s\n" "$s"                     # if s is not prefix of p, then print it
        p="$s"
      done
    

    Explainations: ${p:0:${#s}} take the first ${#s} (len of s) characters in string p.

    Test:

    $ cat sample.txt 
    abc/def/ghi/
    abc/def/ghi/jkl/one/
    abc/def/ghi/jkl/two/
    abc/def/ghi/jkl/one/one
    abc/def/ghi/jkl/two/two
    123/456/
    123/456/789/
    xyz/
    
    $ ./remove-prefix.sh 
    xyz/
    abc/def/ghi/jkl/two/two
    abc/def/ghi/jkl/one/one
    123/456/789/
    

    Step 2: If you really need to keep the order, then this script is an example of removing all prefixes, reordering is not allowed:

    #!/bin/bash
    
    f=sample.txt
    p=''
    
    cat -n "$f" | \
      sed 's:\t:|:' | \
      sort -r -t'|' -k2 | \
      while IFS='|' read -r i s || [[ -n "$s" ]]; do
        [[ "$s" = "${p:0:${#s}}" ]] || printf "%s|%s\n" "$i" "$s"
        p="$s"
      done | \
      sort -n -t'|' -k1 | \
      sed 's:^.*|::'
    

    Explanations:

    1. cat -n: numbering all lines
    2. sed 's:\t:|:': use '|' as the delimiter -- you need to change it to another one if needed
    3. sort -r -t'|' -k2: reverse sort with delimiter='|' and use the key 2
    4. while ... done: similar to solution of step 1
    5. sort -n -t'|' -k1: sort back to original order (numbering sort)
    6. sed 's:^.*|::': remove the numbering

    Test:

    $ ./remove-prefix.sh 
    abc/def/ghi/jkl/one/one
    abc/def/ghi/jkl/two/two
    123/456/789/
    xyz/
    

    Notes: In both solutions, the most costed operations are calls to sort. Solution in step 1 calls sort once, and solution in the step 2 calls sort twice. All other operations (cat, sed, while, string compare,...) are not at the same level of cost.

    In solution of step 2, cat + sed + while + sed is "equivalent" to scan that file 4 times (which theorically can be executed in parallel because of pipe).

提交回复
热议问题