Bash: Split a file in linux in 10 pieces only by blank lines

问题

I am currently working with some files to parse with a Scala app. The problem is that the files are too large so they always end up throwing an exception in the heap size (and I've tried with the max heap size I can and still no use).

Now, the files looks like this:

This is
one paragraph
for Scala
to parse

This is
another paragraph
for Scala
to parse

Yet another
paragraph

And so on. Basically I would like to take all this files and split them in 10 or 20 pieces each, but I have to be sure a paragraph is not splitted in half in the results. Is there any way of doing this?

Thank you!

回答1:

Here's an awk script that will break up input files into batch_size blocks ( with a garbage trailing record separating newline ). Put this into a file and change it into an executable:

#!/usr/bin/awk -f

BEGIN {RS=""; ORS="\n\n"; last_f=""; batch_size=20}

# perform setup whenever the filename changes
FILENAME!=last_f {r_per_f=calc_r_per_f(); incr_out(); last_f=FILENAME; fnum=1}

# write a record to an output file
{print $0 > out}

# after a batch, change the file name
(FNR%r_per_f)==0 {incr_out()}

# function to roll the file name
function incr_out() {close(out); fnum++; out=FILENAME"_"fnum".out"}

# function to get the number of records per file
function calc_r_per_f() {
    cmd=sprintf( "grep \"^$\" %s | wc -l", FILENAME )
    cmd | getline rcnt
    close(cmd)
    return( sprintf( "%d", rcnt/batch_size ) )
    }

You would change the batch_size element in the begin block to adjust the number of output files per input file and the output file name itself can be altered by changing the out= assignment in incr_out().

If you put it into a file called awko, you would run it like awko data1 data2 and get files like data2_7.out for example. Of course the output names are more horrible than that if your input file names have extensions etc.

回答2:

csplit file.txt /^$/ {*}

csplit splits a file separated by the specified pattern.

/^$/ matches empty lines.

{*} repeats the previous pattern indefinitely.

回答3:

to split every 3 paragraphs:

awk 'BEGIN{nParMax=3;npar=0;nFile=0}
     /^$/{npar++;if(npar==nParMax){nFile++;npar=0;next}}
     {print $0 > "foo."nFile}' foo.orig

to split every 10 lines:

awk 'BEGIN{nLineMax=10;nline=0;nFile=0}
    /^$/{if(nline>=nLineMax){nFile++;nline=0;next}}
    {nline++;print $0 > "foo."nFile}' foo.orig

回答4:

You can use "split" command, but as you want to split paragraphs, you can use this kind of script:

awk -v RS="\n\n" 'BEGIN {n=1}{print $0 > "file"n++".txt"}' yourfile.txt

That split each paragraph in file named "file1.txt", "file2.txt", and so on...

To set "n++" each "N" paragraph, you can do:

awk -v RS="\n\n" 'BEGIN{n=1; i=0; nbp=100}{if (i++ == nbp) {i=0; n++} print $0 > "file"n".txt"}' yourfile.txt

Just change "nbp" value to setup the paragraph numbers

回答5:

To split a file of X paragraphs into n (10 below) files where X is some number greater than or equal to n would be:

awk -v RS= -v ORS='\n\n' -n 10 '
    NR==FNR { totParas=NR; parasPerFile=2; next }
    (FNR % parasPerFile) == 1 {
        close(out)
        out = FILENAME "_out" (++c)
        parasLeft = totParas - (FNR - 1)
        parasPerFile = int(parasLeft/n) + (parasLeft%n ? 1 : 0)
    }
    { print > out }
' file file

来源：https://stackoverflow.com/questions/22674245/bash-split-a-file-in-linux-in-10-pieces-only-by-blank-lines

标签

Linux

bash

file

scala

split