发表新帖

发表新帖

`uniq` without sorting an immense text file?

后端未结

关注

 6  2046

我在风中等你 2020-12-18 07:01

I have a stupidly large text file (i.e. 40 gigabytes as of today) that I would like to filter for unique lines without sorting the file.

The file ha

6条回答

感情败类 (楼主)

2020-12-18 07:06
I don't have your data (or anything like it) handy, so I can't test this, but here's a proof of concept for you:
```
$ t='one\ntwo\nthree\none\nfour\nfive\n'
$ printf "$t" | nl -w14 -nrz -s, | sort -t, -k2 -u | sort -n | cut -d, -f2-
one
two
three
four
five
```
Our raw data includes one duplicated line. The pipes function as follows:
- nl adds line numbers. It's a standard, low-impact unix tool.
- sort the first time 'round sorts on the SECOND field -- what would have been the beginning of the line before nl. Adjust this as required for you data.
- sort the second time puts things back in the order defined by the nl command.
- cut merely strips off the line numbers. There are multiple ways to do this, but some of them depend on your OS. This one's portable, and works for my example.
Now... For obscenely large files, the sort command will need some additional options. In particular, --buffer-size and --temporary-directory. Read man sort for details about this.

I can't say I expect this to be fast, and I suspect you'll be using a ginormous amount of disk IO, but I don't see why it wouldn't at least work.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题