`uniq` without sorting an immense text file?

后端 未结 6 2055
我在风中等你
我在风中等你 2020-12-18 07:01

I have a stupidly large text file (i.e. 40 gigabytes as of today) that I would like to filter for unique lines without sorting the file.

The file ha

6条回答
  •  悲哀的现实
    2020-12-18 07:29

    I'd do it like this:

    #! /bin/sh
    usage ()
    {
        echo "Usage:  ${0##*/}  []" >&2
        exit 1
    }
    
    
    if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
    if [ "$2" ]; then
        expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
    fi
    
    LC_ALL=C
    export LC_ALL
    
    split -l ${2:-10000} -d -a 6 "$1"
    
    for x in x*; do
        awk '!x[$0]++' "$x" >"y${x}" && rm -f "$x"
    done
    
    cat $(sort -n yx*) | sort | uniq -d | \
        while IFS= read -r line; do
            fgrep -x -n "$line" /dev/null yx* | sort -n | sed 1d | \
                while IFS=: read -r file nr rest; do
                    sed -i -d ${nr}d "$file"
                done
        done
    
    cat $(sort -n yx*) >uniq_"$1" && rm -f yx*
    

    (proof of concept; needs more polishing before being used in production).

    What's going on here:

    • split splits the file in chunks of 10000 lines (configurable); the chunks are named x000000, x000001, ...
    • awk removes duplicates from each chunk, without messing with the line order; the resulting files are yx000000, yx000001, ... (since awk can't portably do changes in place)
    • cat $(sort -n yx*) | sort | uniq -d reassembles the chunks and finds a list of duplicates; because of the way the chunks were constructed, each duplicated line can appear at most once in each chunk
    • fgrep -x -n "$line" /dev/null yx* finds where each duplicated line lives; the result is a list of lines yx000005:23:some text
    • sort -n | sed 1d removes the first chunk from the list above (this is the first occurrence of the line, and it should be left alone)
    • IFS=: read -r file nr rest splits yx000005:23:some text into file=yx000005, nr=23, and the rest
    • sed -i -e ${nr}d "$file" removes line $nr from chunk $file
    • cat $(sort -n yx*) reassembles the chunks; they need to be sorted, to make sure they come in the right order.

    This is probably not very fast, but I'd say it should work. Increasing the number of lines in each chunk from 10000 can speed things up, at the expense of using more memory. The operation is O(N^2) in the number of duplicate lines across chunks; with luck, this wouldn't be too large.

    The above assumes GNU sed (for -i). It also assumes there are no files named x* or yx* in the current directory (that's the part that could use some cleanup, perhaps by moving the junk into a directory created by mktemp -d).

    Edit: Second version, after feedback from @EtanReisner:

    #! /bin/sh
    usage ()
    {
        echo "Usage:  ${0##*/}  []" >&2
        exit 1
    }
    
    
    if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
    if [ "$2" ]; then
        expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
    fi
    
    tdir=$(mktemp -d -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
    dupes=$(mktemp -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
    
    trap 'rm -rf "$tdir" "$dupes"' EXIT HUP INT QUIT TERM
    
    LC_ALL=C
    export LC_ALL
    
    split -l ${2:-10000} -d -a 6 "$1" "${tdir}/x"
    
    ls -1 "$tdir" | while IFS= read -r x; do
        awk '!x[$0]++' "${tdir}/${x}" >"${tdir}/y${x}" && \
        rm -f "${tdir}/$x" || exit 1
    done
    
    find "$tdir" -type f -name 'yx*' | \
        xargs -n 1 cat | \
        sort | \
        uniq -d >"$dupes" || exit 1
    
    find "$tdir" -type f -name 'yx*' -exec fgrep -x -n -f "$dupes" /dev/null {} + | \
        sed 's!.*/!!' | \
        sort -t: -n -k 1.3,1 -k 2,2 | \
        perl '
            while() {
                chomp;
                m/^(yx\d+):(\d+):(.*)$/o;
                if ($dupes{$3}++)
                    { push @{$del{$1}}, int($2) }
                else
                    { $del{$1} = [] }
            }
            undef %dupes;
    
            chdir $ARGV[0];
    
            for $fn (sort <"yx*">) {
                open $fh, "<", $fn
                    or die qq(open $fn: $!);
                $line = $idx = 0;
                while(<$fh>) {
                    $line++;
                    if ($idx < @{$del{$fn}} and $line == $del{$fn}->[$idx])
                        { $idx++ }
                    else
                        { print }
                }
                close $fh
                    or die qq(close $fn: $!);
                unlink $fn
                    or die qq(remove $fn: $!);
            }
        ' "$tdir" >uniq_"$1" || exit 1
    

提交回复
热议问题