Advanced `uniq` with “unique part regex”

不想你离开。 提交于 2019-12-01 23:50:00

问题


uniq is a tool that enables once to filter lines in a file such that only unique lines are shown. uniq has some support to specify when two lines are "equivalent", but the options are limited.

I'm looking for a tool/extension on uniq that allows one to enter a regex. If the captured group is the same for two lines, then the two lines are considered "equivalent". Only the "first match" is returned for each equivalence class.

Example:

file.dat:

foo!bar!baz
!baz!quix
!bar!foobar
ID!baz!

Using grep -P '(!\w+!)' -o, one can extract the "unique parts":

!bar!
!baz!
!bar!
!baz!

This means that the first line is considered to be "equivalent" with the third and the second with the fourth. Thus only the first and the second are printed (the third and fourth are ignored).

Then uniq '(!\w+!)' < file.dat should return:

foo!bar!baz
!baz!quix

回答1:


Not using uniq but using gnu-awk you can get the results you want:

awk -v re='![[:alnum:]]+!' 'match($0, re, a) && !(a[0] in p) {p[a[0]]; print}' file
foo!bar!baz
!baz!quix
  • Passing required regex using a command line variable -v re=...
  • match function matches regex for each line and returns matched text in [a]
  • Every time match succeeds we store matched text in an associative array p and print
  • Thus effectively getting uniq function with regex support



回答2:


Here's a simple Perl script that will do the work:

#!/usr/bin/env perl
use strict;
use warnings;

my $re = qr($ARGV[0]);

my %matches;
while(<STDIN>) {
    next if $_ !~ $re;
    print if !$matches{$1};
    $matches{$1} = 1;
}

Usage:

$ ./uniq.pl '(!\w+!)' < file.dat
foo!bar!baz
!baz!quix

Here, I've used $1 to match on the first extracted group, but you can replace it with $& to use the whole pattern match.
This script will filter out lines that don't match the regex, but you can adjust it if you need a different behavior.




回答3:


You can do this with just grep and sort

DATAFILE=file.dat

for match in $(grep -P '(!\w+!)' -o "$DATAFILE" | sort -u); do 
  grep -m1 "$match" "$DATAFILE";
done

Outputs:

foo!bar!baz
!baz!quix


来源:https://stackoverflow.com/questions/26633425/advanced-uniq-with-unique-part-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!