I have a stupidly large text file (i.e. 40 gigabytes as of today) that I would like to filter for unique lines without sorting the file.
The file ha
I'd do it like this:
#! /bin/sh
usage ()
{
echo "Usage: ${0##*/} []" >&2
exit 1
}
if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
if [ "$2" ]; then
expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
fi
LC_ALL=C
export LC_ALL
split -l ${2:-10000} -d -a 6 "$1"
for x in x*; do
awk '!x[$0]++' "$x" >"y${x}" && rm -f "$x"
done
cat $(sort -n yx*) | sort | uniq -d | \
while IFS= read -r line; do
fgrep -x -n "$line" /dev/null yx* | sort -n | sed 1d | \
while IFS=: read -r file nr rest; do
sed -i -d ${nr}d "$file"
done
done
cat $(sort -n yx*) >uniq_"$1" && rm -f yx*
(proof of concept; needs more polishing before being used in production).
What's going on here:
split splits the file in chunks of 10000 lines (configurable); the chunks are named x000000, x000001, ...awk removes duplicates from each chunk, without messing with the line order; the resulting files are yx000000, yx000001, ... (since awk can't portably do changes in place)cat $(sort -n yx*) | sort | uniq -d reassembles the chunks and finds a list of duplicates; because of the way the chunks were constructed, each duplicated line can appear at most once in each chunkfgrep -x -n "$line" /dev/null yx* finds where each duplicated line lives; the result is a list of lines yx000005:23:some textsort -n | sed 1d removes the first chunk from the list above (this is the first occurrence of the line, and it should be left alone)IFS=: read -r file nr rest splits yx000005:23:some text into file=yx000005, nr=23, and the restsed -i -e ${nr}d "$file" removes line $nr from chunk $filecat $(sort -n yx*) reassembles the chunks; they need to be sorted, to make sure they come in the right order.This is probably not very fast, but I'd say it should work. Increasing the number of lines in each chunk from 10000 can speed things up, at the expense of using more memory. The operation is O(N^2) in the number of duplicate lines across chunks; with luck, this wouldn't be too large.
The above assumes GNU sed (for -i). It also assumes there are no files named x* or yx* in the current directory (that's the part that could use some cleanup, perhaps by moving the junk into a directory created by mktemp -d).
Edit: Second version, after feedback from @EtanReisner:
#! /bin/sh
usage ()
{
echo "Usage: ${0##*/} []" >&2
exit 1
}
if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
if [ "$2" ]; then
expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
fi
tdir=$(mktemp -d -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
dupes=$(mktemp -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
trap 'rm -rf "$tdir" "$dupes"' EXIT HUP INT QUIT TERM
LC_ALL=C
export LC_ALL
split -l ${2:-10000} -d -a 6 "$1" "${tdir}/x"
ls -1 "$tdir" | while IFS= read -r x; do
awk '!x[$0]++' "${tdir}/${x}" >"${tdir}/y${x}" && \
rm -f "${tdir}/$x" || exit 1
done
find "$tdir" -type f -name 'yx*' | \
xargs -n 1 cat | \
sort | \
uniq -d >"$dupes" || exit 1
find "$tdir" -type f -name 'yx*' -exec fgrep -x -n -f "$dupes" /dev/null {} + | \
sed 's!.*/!!' | \
sort -t: -n -k 1.3,1 -k 2,2 | \
perl '
while() {
chomp;
m/^(yx\d+):(\d+):(.*)$/o;
if ($dupes{$3}++)
{ push @{$del{$1}}, int($2) }
else
{ $del{$1} = [] }
}
undef %dupes;
chdir $ARGV[0];
for $fn (sort <"yx*">) {
open $fh, "<", $fn
or die qq(open $fn: $!);
$line = $idx = 0;
while(<$fh>) {
$line++;
if ($idx < @{$del{$fn}} and $line == $del{$fn}->[$idx])
{ $idx++ }
else
{ print }
}
close $fh
or die qq(close $fn: $!);
unlink $fn
or die qq(remove $fn: $!);
}
' "$tdir" >uniq_"$1" || exit 1