`uniq` without sorting an immense text file?

后端 未结 6 2044
我在风中等你
我在风中等你 2020-12-18 07:01

I have a stupidly large text file (i.e. 40 gigabytes as of today) that I would like to filter for unique lines without sorting the file.

The file ha

相关标签:
6条回答
  • 2020-12-18 07:06

    I don't have your data (or anything like it) handy, so I can't test this, but here's a proof of concept for you:

    $ t='one\ntwo\nthree\none\nfour\nfive\n'
    $ printf "$t" | nl -w14 -nrz -s, | sort -t, -k2 -u | sort -n | cut -d, -f2-
    one
    two
    three
    four
    five
    

    Our raw data includes one duplicated line. The pipes function as follows:

    • nl adds line numbers. It's a standard, low-impact unix tool.
    • sort the first time 'round sorts on the SECOND field -- what would have been the beginning of the line before nl. Adjust this as required for you data.
    • sort the second time puts things back in the order defined by the nl command.
    • cut merely strips off the line numbers. There are multiple ways to do this, but some of them depend on your OS. This one's portable, and works for my example.

    Now... For obscenely large files, the sort command will need some additional options. In particular, --buffer-size and --temporary-directory. Read man sort for details about this.

    I can't say I expect this to be fast, and I suspect you'll be using a ginormous amount of disk IO, but I don't see why it wouldn't at least work.

    0 讨论(0)
  • 2020-12-18 07:12

    Assuming you can sort the file in the first place (i.e. that you can get sort file to work) then I think something like this might work (depends on whether a large awk script file is better then a large awk array in terms of memory usage/etc.).

    sort file | uniq -dc | awk '{gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++; if (x["NR"]>1){next}}"} END{print 7}' > dedupe.awk
    awk -f dedupe.awk file
    

    Which on a test input file like:

    line 1
    line 2
    line 3
    line 2
    line 2
    line 3
    line 4
    line 5
    line 6
    

    creates an awk script of:

    $0=="line 2"{x[1]++; if (x[1]>1){next}}
    $0=="line 3"{x[2]++; if (x[2]>1){next}}
    7
    

    and run as awk -f dedupe.awk file outputs:

    line 1
    line 2
    line 3
    line 4
    line 5
    line 6
    

    If the size of the awk script itself is a problem (probably unlikely) you could cut that down by using another sentinel value something like:

    sort file | uniq -dc | awk 'BEGIN{print "{f=1}"} {gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++;f=(x["NR"]<=1)}"} END{print "f"}'
    

    which cuts seven characters off each line (six if you remove the space from the original too) and generates:

    {f=1}
    $0=="line 2"{x[1]++;f=(x[1]<=1)}
    $0=="line 3"{x[2]++;f=(x[2]<=1)}
    f
    

    This solution will probably run slower though because it doesn't short-circuit the script as matches are found.

    If runtime of the awk script is too great it might even be possible to improve the time by sorting the duplicate lines based on match count (but whether that matters is going to be fairly data dependent).

    0 讨论(0)
  • 2020-12-18 07:13

    The awk '!x[$0]++' trick is one of the most elegant solutions to de-duplicate a file or stream without sorting. However, it is inefficient in terms of memory and unsuitable for large files, since it saves all unique lines into memory.

    However, a much more efficient implementation would be to save a constant-length hash representation of the lines in the array rather than the whole line. You can achieve this with Perl in one line and it is quite similar to the awk script.

    perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' huge.txt
    

    Here I used md5_base64 instead of md5_hex because the base64 encoding takes 22 bytes, while the hex representation 32.

    However, since the Perl implementation of hashes still requires around 120bytes for each key, you may quickly run out of memory for your huge file.

    The solution in this case is to process the file in chunks, splitting manually or using GNU Parallel with the --pipe, --keep-order and --block options (taking advantage of the fact that duplicate lines are not far apart, as you mentioned). Here is how you could do it with parallel:

    cat huge.txt | pv | 
    parallel --pipe --keep-order --block 100M -j4 -q \
    perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' > uniq.txt
    

    The --block 100M option tells parallel to process the input in chunks of 100MB. -j4 means start 4 processes in parallel. An important argument here is --keep-order since you want the unique lines output to remain in the same order. I have included pv in the pipeline to get some nice statistics while the long running process is executing.

    In a benchmark I performed with a random-data 1GB file, I reached a 130MB/sec throughput with the above settings, meaning you may de-duplicate your 40GB file in 4 minutes (if you have a sufficiently fast hard disk able to write at this rate).

    Other options include:

    • Use an efficient trie structure to store keys and check for duplicates. For example a very efficient implementation is marisa-trie coded in C++ with wrappers in Python.
    • Sort your huge file with an external merge sort or distribution/bucket sort
    • Store your file in a database and use SELECT DISTINCT on an indexed column containing your lines or most efficiently md5_sums of your lines.
    • Or use bloom filters

    Here is an example of using the Bloom::Faster module of Perl:

    perl -e 'use Bloom::Faster; my $f = new Bloom::Faster({n => 100000000, e => 0.00001}); while(<>) { print unless $f->add($_); }' huge.txt > uniq.txt
    

    You may install Bloom::Faster from cran (sudo cran and then install "Bloom::Faster")

    Explanation:

    • You have to specify the probabilistic error rate e and the number of available buckets n. The memory required for each bucket is about 2.5 bytes. If your file has 100 million unique lines then you will need 100 million buckets and around 260MB of memory.
    • The $f->add($_) function adds the hash of a line to the filter and returns true if the key (i.e. the line here) is a duplicate.
    • You can get an estimation of the number of unique lines in your file, parsing a small section of your file with dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l (400MB) and multiplying that number by 100 (40GB). Then set the n option a little higher to be on the safe side.

    In my benchmarks, this method achieved a 6MB/s processing rate. You may combine this approach with the GNU parallel suggestion above to utilize multiple cores and achieve a higher throughput.

    0 讨论(0)
  • 2020-12-18 07:15

    Maybe not the answer you've been looking for but here goes: use a bloom filter. https://en.wikipedia.org/wiki/Bloom_filter This sort of problem is one of the main reasons they exist.

    0 讨论(0)
  • 2020-12-18 07:17

    If there's a lot of duplication, one possibility is to split the file using split(1) into manageable pieces and using something conventional like sort/uniq to make a summary of unique lines. This will be shorter than the actual piece itself. After this, you can compare the pieces to arrive at an actual summary.

    0 讨论(0)
  • 2020-12-18 07:29

    I'd do it like this:

    #! /bin/sh
    usage ()
    {
        echo "Usage:  ${0##*/} <file> [<lines>]" >&2
        exit 1
    }
    
    
    if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
    if [ "$2" ]; then
        expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
    fi
    
    LC_ALL=C
    export LC_ALL
    
    split -l ${2:-10000} -d -a 6 "$1"
    
    for x in x*; do
        awk '!x[$0]++' "$x" >"y${x}" && rm -f "$x"
    done
    
    cat $(sort -n yx*) | sort | uniq -d | \
        while IFS= read -r line; do
            fgrep -x -n "$line" /dev/null yx* | sort -n | sed 1d | \
                while IFS=: read -r file nr rest; do
                    sed -i -d ${nr}d "$file"
                done
        done
    
    cat $(sort -n yx*) >uniq_"$1" && rm -f yx*
    

    (proof of concept; needs more polishing before being used in production).

    What's going on here:

    • split splits the file in chunks of 10000 lines (configurable); the chunks are named x000000, x000001, ...
    • awk removes duplicates from each chunk, without messing with the line order; the resulting files are yx000000, yx000001, ... (since awk can't portably do changes in place)
    • cat $(sort -n yx*) | sort | uniq -d reassembles the chunks and finds a list of duplicates; because of the way the chunks were constructed, each duplicated line can appear at most once in each chunk
    • fgrep -x -n "$line" /dev/null yx* finds where each duplicated line lives; the result is a list of lines yx000005:23:some text
    • sort -n | sed 1d removes the first chunk from the list above (this is the first occurrence of the line, and it should be left alone)
    • IFS=: read -r file nr rest splits yx000005:23:some text into file=yx000005, nr=23, and the rest
    • sed -i -e ${nr}d "$file" removes line $nr from chunk $file
    • cat $(sort -n yx*) reassembles the chunks; they need to be sorted, to make sure they come in the right order.

    This is probably not very fast, but I'd say it should work. Increasing the number of lines in each chunk from 10000 can speed things up, at the expense of using more memory. The operation is O(N^2) in the number of duplicate lines across chunks; with luck, this wouldn't be too large.

    The above assumes GNU sed (for -i). It also assumes there are no files named x* or yx* in the current directory (that's the part that could use some cleanup, perhaps by moving the junk into a directory created by mktemp -d).

    Edit: Second version, after feedback from @EtanReisner:

    #! /bin/sh
    usage ()
    {
        echo "Usage:  ${0##*/} <file> [<lines>]" >&2
        exit 1
    }
    
    
    if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
    if [ "$2" ]; then
        expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
    fi
    
    tdir=$(mktemp -d -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
    dupes=$(mktemp -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
    
    trap 'rm -rf "$tdir" "$dupes"' EXIT HUP INT QUIT TERM
    
    LC_ALL=C
    export LC_ALL
    
    split -l ${2:-10000} -d -a 6 "$1" "${tdir}/x"
    
    ls -1 "$tdir" | while IFS= read -r x; do
        awk '!x[$0]++' "${tdir}/${x}" >"${tdir}/y${x}" && \
        rm -f "${tdir}/$x" || exit 1
    done
    
    find "$tdir" -type f -name 'yx*' | \
        xargs -n 1 cat | \
        sort | \
        uniq -d >"$dupes" || exit 1
    
    find "$tdir" -type f -name 'yx*' -exec fgrep -x -n -f "$dupes" /dev/null {} + | \
        sed 's!.*/!!' | \
        sort -t: -n -k 1.3,1 -k 2,2 | \
        perl '
            while(<STDIN>) {
                chomp;
                m/^(yx\d+):(\d+):(.*)$/o;
                if ($dupes{$3}++)
                    { push @{$del{$1}}, int($2) }
                else
                    { $del{$1} = [] }
            }
            undef %dupes;
    
            chdir $ARGV[0];
    
            for $fn (sort <"yx*">) {
                open $fh, "<", $fn
                    or die qq(open $fn: $!);
                $line = $idx = 0;
                while(<$fh>) {
                    $line++;
                    if ($idx < @{$del{$fn}} and $line == $del{$fn}->[$idx])
                        { $idx++ }
                    else
                        { print }
                }
                close $fh
                    or die qq(close $fn: $!);
                unlink $fn
                    or die qq(remove $fn: $!);
            }
        ' "$tdir" >uniq_"$1" || exit 1
    
    0 讨论(0)
提交回复
热议问题