问题
I am currently working with some files to parse with a Scala app. The problem is that the files are too large so they always end up throwing an exception in the heap size (and I've tried with the max heap size I can and still no use).
Now, the files looks like this:
This is
one paragraph
for Scala
to parse
This is
another paragraph
for Scala
to parse
Yet another
paragraph
And so on. Basically I would like to take all this files and split them in 10 or 20 pieces each, but I have to be sure a paragraph is not splitted in half in the results. Is there any way of doing this?
Thank you!
回答1:
Here's an awk script that will break up input files into batch_size
blocks ( with a garbage trailing record separating newline ). Put this into a file and change it into an executable:
#!/usr/bin/awk -f
BEGIN {RS=""; ORS="\n\n"; last_f=""; batch_size=20}
# perform setup whenever the filename changes
FILENAME!=last_f {r_per_f=calc_r_per_f(); incr_out(); last_f=FILENAME; fnum=1}
# write a record to an output file
{print $0 > out}
# after a batch, change the file name
(FNR%r_per_f)==0 {incr_out()}
# function to roll the file name
function incr_out() {close(out); fnum++; out=FILENAME"_"fnum".out"}
# function to get the number of records per file
function calc_r_per_f() {
cmd=sprintf( "grep \"^$\" %s | wc -l", FILENAME )
cmd | getline rcnt
close(cmd)
return( sprintf( "%d", rcnt/batch_size ) )
}
You would change the batch_size
element in the begin block to adjust the number of output files per input file and the output file name itself can be altered by changing the out=
assignment in incr_out()
.
If you put it into a file called awko
, you would run it like awko data1 data2
and get files like data2_7.out
for example. Of course the output names are more horrible than that if your input file names have extensions etc.
回答2:
csplit file.txt /^$/ {*}
csplit
splits a file separated by the specified pattern.
/^$/
matches empty lines.
{*}
repeats the previous pattern indefinitely.
回答3:
to split every 3 paragraphs:
awk 'BEGIN{nParMax=3;npar=0;nFile=0}
/^$/{npar++;if(npar==nParMax){nFile++;npar=0;next}}
{print $0 > "foo."nFile}' foo.orig
to split every 10 lines:
awk 'BEGIN{nLineMax=10;nline=0;nFile=0}
/^$/{if(nline>=nLineMax){nFile++;nline=0;next}}
{nline++;print $0 > "foo."nFile}' foo.orig
回答4:
You can use "split" command, but as you want to split paragraphs, you can use this kind of script:
awk -v RS="\n\n" 'BEGIN {n=1}{print $0 > "file"n++".txt"}' yourfile.txt
That split each paragraph in file named "file1.txt", "file2.txt", and so on...
To set "n++" each "N" paragraph, you can do:
awk -v RS="\n\n" 'BEGIN{n=1; i=0; nbp=100}{if (i++ == nbp) {i=0; n++} print $0 > "file"n".txt"}' yourfile.txt
Just change "nbp" value to setup the paragraph numbers
回答5:
To split a file of X paragraphs into n (10
below) files where X is some number greater than or equal to n would be:
awk -v RS= -v ORS='\n\n' -n 10 '
NR==FNR { totParas=NR; parasPerFile=2; next }
(FNR % parasPerFile) == 1 {
close(out)
out = FILENAME "_out" (++c)
parasLeft = totParas - (FNR - 1)
parasPerFile = int(parasLeft/n) + (parasLeft%n ? 1 : 0)
}
{ print > out }
' file file
来源:https://stackoverflow.com/questions/22674245/bash-split-a-file-in-linux-in-10-pieces-only-by-blank-lines