split file on Nth occurrence of delimiter

问题

Is there a one-liner to split a text file into pieces / chunks after every Nth occurrence of a delimiter?

example: the delimiter below is "+"

entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
...

There are several million entries, so splitting on every occurrence of delimiter "+" is a bad idea. I want to split on, say, every 50,000th instance of delimiter "+".

Unix commands "split" and "csplit" just don't seem to do this...

回答1:

Using awk you could:

awk '/^\+$/ { delim++ } { file = sprintf("chunk%s.txt", int(delim / 50000)); print >> file; }' < input.txt

Update:

To not include the delimiter, try this:

awk '/^\+$/ { if(++delim % 50000 == 0) { next } } { file = sprintf("chunk%s.txt", int(delim / 50000)); print > file; }' < input.txt

The next keyword causes awk to halt processing rules for this record and and advance to the next (line). I also changed the >> to > since if you run it more than once you probably don't want to append the old chunk files.

回答2:

It isn't very hard to do in Perl if you can't find a suitable alternative (and it will perform pretty well):

#!/usr/bin/env perl
use strict;
use warnings;

# Configuration items - could be set by argument handling
my $prefix = "rs.";     # File prefix
my $number = 1;         # First file number
my $width  = 4;         # Number of digits to use in file name
my $rx     = qr/^\+$/;  # Match regex
my $limit  = 3;         # 50,000 in real case
my $quiet  = 0;         # Set to 1 to suppress file names

sub next_file
{
    my $name = sprintf("%s%.*d", $prefix, $width, $number++);
    open my $fh, '>', $name or die "Failed to open $name for writing";
    print "$name\n" unless $quiet;
    return $fh;
}

my $fh = next_file;  # Output file handle
my $counter = 0;     # Match counter
while (<>)
{
    print $fh $_;
    $counter++ if (m/$rx/);
    if ($counter >= $limit)
    {
        close $fh;
        $fh = next_file;
        $counter = 0;
    }
}
close $fh;

That's far from being a one-liner; I'm not sure whether that's a merit or not. The items that should be configured are grouped together, and could be set via command line options, for example. You could end up with an empty file; you could spot that and remove it if necessary. You'd need a second counter; the existing one is a 'match counter' but you'd also need a line counter, and if the line counter was zero at the you'd remove the last file. You'd also need the name to be able to remove it...fiddly, but not difficult.

Give the input (basically two copies of your sample data), the output from repsplit.pl (repeat split) was as shown:

$ perl repsplit.pl data
rs.0001
rs.0002
rs.0003
$ cat data
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
$ cat rs.0001
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
$ cat rs.0002
entry 4
some more
+
entry 1
some more
+
entry 2
some more
even more
+
$ cat rs.0003
entry 3
some more
+
entry 4
some more
+
$

回答3:

Using perl and + as input separator in a concise "one-liner" :

If you'd like to do $_ > newprefix.part.$c like stated in your comment :

$ limit=50000 perl -053 -Mautodie -lne '
    BEGIN{$\=""}
    $count++;
    if ($count >= $ENV{limit}) {
        open my $fh, ">", "newprefix.part.$c";
        print $fh $_;
        close $fh;
    }
' file.txt

$ ls -l newprefix.part.*

Doc

man ascii
perldoc perlrun
perldoc perlvar

来源：https://stackoverflow.com/questions/15559979/split-file-on-nth-occurrence-of-delimiter

标签

file

unix

split

chunking