I have a stupidly large text file (i.e. 40 gigabytes as of today) that I would like to filter for unique lines without sorting the file.
The file ha
I don't have your data (or anything like it) handy, so I can't test this, but here's a proof of concept for you:
$ t='one\ntwo\nthree\none\nfour\nfive\n'
$ printf "$t" | nl -w14 -nrz -s, | sort -t, -k2 -u | sort -n | cut -d, -f2-
one
two
three
four
five
Our raw data includes one duplicated line. The pipes function as follows:
nl
adds line numbers. It's a standard, low-impact unix tool.sort
the first time 'round sorts on the SECOND field -- what would have been the beginning of the line before nl
. Adjust this as required for you data.sort
the second time puts things back in the order defined by the nl
command.cut
merely strips off the line numbers. There are multiple ways to do this, but some of them depend on your OS. This one's portable, and works for my example.Now... For obscenely large files, the sort
command will need some additional options. In particular, --buffer-size
and --temporary-directory
. Read man sort
for details about this.
I can't say I expect this to be fast, and I suspect you'll be using a ginormous amount of disk IO, but I don't see why it wouldn't at least work.