`uniq` without sorting an immense text file?

后端未结

关注

 6  2055

我在风中等你 2020-12-18 07:01

I have a stupidly large text file (i.e. 40 gigabytes as of today) that I would like to filter for unique lines without sorting the file.

The file ha

6条回答

悲哀的现实 (楼主)

2020-12-18 07:29

I'd do it like this:

#! /bin/sh
usage ()
{
    echo "Usage:  ${0##*/}  []" >&2
    exit 1
}


if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
if [ "$2" ]; then
    expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
fi

LC_ALL=C
export LC_ALL

split -l ${2:-10000} -d -a 6 "$1"

for x in x*; do
    awk '!x[$0]++' "$x" >"y${x}" && rm -f "$x"
done

cat $(sort -n yx*) | sort | uniq -d | \
    while IFS= read -r line; do
        fgrep -x -n "$line" /dev/null yx* | sort -n | sed 1d | \
            while IFS=: read -r file nr rest; do
                sed -i -d ${nr}d "$file"
            done
    done

cat $(sort -n yx*) >uniq_"$1" && rm -f yx*

(proof of concept; needs more polishing before being used in production).

What's going on here:

split splits the file in chunks of 10000 lines (configurable); the chunks are named x000000, x000001, ...
awk removes duplicates from each chunk, without messing with the line order; the resulting files are yx000000, yx000001, ... (since awk can't portably do changes in place)
cat $(sort -n yx*) | sort | uniq -d reassembles the chunks and finds a list of duplicates; because of the way the chunks were constructed, each duplicated line can appear at most once in each chunk
fgrep -x -n "$line" /dev/null yx* finds where each duplicated line lives; the result is a list of lines yx000005:23:some text
sort -n | sed 1d removes the first chunk from the list above (this is the first occurrence of the line, and it should be left alone)
IFS=: read -r file nr rest splits yx000005:23:some text into file=yx000005, nr=23, and the rest
sed -i -e ${nr}d "$file" removes line $nr from chunk $file
cat $(sort -n yx*) reassembles the chunks; they need to be sorted, to make sure they come in the right order.

This is probably not very fast, but I'd say it should work. Increasing the number of lines in each chunk from 10000 can speed things up, at the expense of using more memory. The operation is O(N^2) in the number of duplicate lines across chunks; with luck, this wouldn't be too large.

The above assumes GNU sed (for -i). It also assumes there are no files named x* or yx* in the current directory (that's the part that could use some cleanup, perhaps by moving the junk into a directory created by mktemp -d).

Edit: Second version, after feedback from @EtanReisner:

#! /bin/sh
usage ()
{
    echo "Usage:  ${0##*/}  []" >&2
    exit 1
}


if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
if [ "$2" ]; then
    expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
fi

tdir=$(mktemp -d -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
dupes=$(mktemp -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1

trap 'rm -rf "$tdir" "$dupes"' EXIT HUP INT QUIT TERM

LC_ALL=C
export LC_ALL

split -l ${2:-10000} -d -a 6 "$1" "${tdir}/x"

ls -1 "$tdir" | while IFS= read -r x; do
    awk '!x[$0]++' "${tdir}/${x}" >"${tdir}/y${x}" && \
    rm -f "${tdir}/$x" || exit 1
done

find "$tdir" -type f -name 'yx*' | \
    xargs -n 1 cat | \
    sort | \
    uniq -d >"$dupes" || exit 1

find "$tdir" -type f -name 'yx*' -exec fgrep -x -n -f "$dupes" /dev/null {} + | \
    sed 's!.*/!!' | \
    sort -t: -n -k 1.3,1 -k 2,2 | \
    perl '
        while() {
            chomp;
            m/^(yx\d+):(\d+):(.*)$/o;
            if ($dupes{$3}++)
                { push @{$del{$1}}, int($2) }
            else
                { $del{$1} = [] }
        }
        undef %dupes;

        chdir $ARGV[0];

        for $fn (sort <"yx*">) {
            open $fh, "<", $fn
                or die qq(open $fn: $!);
            $line = $idx = 0;
            while(<$fh>) {
                $line++;
                if ($idx < @{$del{$fn}} and $line == $del{$fn}->[$idx])
                    { $idx++ }
                else
                    { print }
            }
            close $fh
                or die qq(close $fn: $!);
            unlink $fn
                or die qq(remove $fn: $!);
        }
    ' "$tdir" >uniq_"$1" || exit 1

0 讨论(0)

查看其它6个回答