`uniq` without sorting an immense text file?

后端未结

关注

 6  2057

我在风中等你 2020-12-18 07:01

I have a stupidly large text file (i.e. 40 gigabytes as of today) that I would like to filter for unique lines without sorting the file.

The file ha

6条回答

情书的邮戳 (楼主)

2020-12-18 07:12
Assuming you can sort the file in the first place (i.e. that you can get sort file to work) then I think something like this might work (depends on whether a large awk script file is better then a large awk array in terms of memory usage/etc.).
```
sort file | uniq -dc | awk '{gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++; if (x["NR"]>1){next}}"} END{print 7}' > dedupe.awk
awk -f dedupe.awk file
```
Which on a test input file like:
```
line 1
line 2
line 3
line 2
line 2
line 3
line 4
line 5
line 6
```
creates an awk script of:
```
$0=="line 2"{x[1]++; if (x[1]>1){next}}
$0=="line 3"{x[2]++; if (x[2]>1){next}}
7
```
and run as awk -f dedupe.awk file outputs:
```
line 1
line 2
line 3
line 4
line 5
line 6
```
If the size of the awk script itself is a problem (probably unlikely) you could cut that down by using another sentinel value something like:
```
sort file | uniq -dc | awk 'BEGIN{print "{f=1}"} {gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++;f=(x["NR"]<=1)}"} END{print "f"}'
```
which cuts seven characters off each line (six if you remove the space from the original too) and generates:
```
{f=1}
$0=="line 2"{x[1]++;f=(x[1]<=1)}
$0=="line 3"{x[2]++;f=(x[2]<=1)}
f
```
This solution will probably run slower though because it doesn't short-circuit the script as matches are found.

If runtime of the awk script is too great it might even be possible to improve the time by sorting the duplicate lines based on match count (but whether that matters is going to be fairly data dependent).
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...