Split file by vector of line numbers

爱⌒轻易说出口 提交于 2021-02-16 20:10:35

问题


I have a large file, about 10GB. I have a vector of line numbers which I would like to use to split the file. Ideally I would like to accomplish this using command-line utilities. As a regex:

File:

1 2 3 
4 5 6
7 8 9 
10 11 12 
13 14 15
16 17 18

Vector of line numbers:

2 5

Desired output:

File 1:

1 2 3 

File 2:

4 5 6
7 8 9 
10 11 12 

File 3:

13 14 15
16 17 18

回答1:


This might work for you:

csplit -z file 2 5

or if you want regexp:

csplit -z file /2/ /5/

With the default values, the output files will be named xxnn where nn starts at 00 and is incremented by 1.

N.B. The -z option prevents empty elided files.




回答2:


Using awk:

$ awk -v v="2 5" '       # space-separated vector if indexes
BEGIN {
    n=split(v,t)         # reshape vector to a hash
    for(i=1;i<=n;i++)
        a[t[i]]
    i=1                  # filename index
}
{
    if(NR in a) {        # file record counter in the vector
        close("file" i)  # close previous file
        i++              # increase filename index
    }
    print > ("file" i)   # output to file
}' file

Sample output:

$ cat file2
4 5 6
7 8 9 
10 11 12 



回答3:


Very slightly different from James's and kvantour's solutions: passing the vector to awk as a "file"

vec="2 5"

awk '
    NR == FNR {nr[$1]; next}
    FNR == 1 {filenum = 1; f = FILENAME "." filenum}
    FNR in nr {
        close(f)
        f = FILENAME "." ++filenum
    }
    {print > f}
' <(printf "%s\n" $vec) file
$ ls -l file file.*
-rw-r--r-- 1 glenn glenn 48 Jul 17 10:02 file
-rw-r--r-- 1 glenn glenn  7 Jul 17 10:09 file.1
-rw-r--r-- 1 glenn glenn 23 Jul 17 10:09 file.2
-rw-r--r-- 1 glenn glenn 18 Jul 17 10:09 file.3



回答4:


Here is a little awk that does the trick for you:

awk -v v="2 5" 'BEGIN{v=" 1 "v" "}
                index(v," "FNR" ") { close(f); f=FILENAME "." (++i) }
                { print > f }' file

This will create files of the form: file.1, file.2, file.3, ...




回答5:


Ok, I've gone totally mental this morning, and I came up with a Sed program (with functions, loops, and all) to generate a Sed script to make what you want.

Usage:

  • put the script in a file (e.g. make.sed) and chmod +x it;
  • then use it as the script for this Sed command sed "$(./make.sed <<< '1 4')" inputfile¹

Note that ./make.sed <<< '1 4' generates the following sed script:

1,1{w file.1
be};1,4{w file.2
be};1,${w file.3
be};:e

¹ Unfortunately I misread the question, so my script works taking the line number of the last line of each block that you want to write to file, so your 2 5 has to be changed to 1 4 to be fed to my script.

#!/usr/bin/env -S sed -Ef

###########################################################
# Main
# make a template sed script, in which we only have to increase
# the number of each numbered output file, each of which is marked
# with a trailing \x0
b makeSkeletonAndMarkNumbers
:skeletonMade

# try putting a stencil on the rightmost digit of the first marked number on
# the line and loop, otherwise exit
b stencilLeastDigitOfNextMarkedNumber
:didStencilLeastDigitOfNextMarkedNumber?
t nextNumberStenciled
b exit

# continue processing next number by adding 1
:nextNumberStenciled
b numberAdd1
:numberAdded1

# try putting a stencil on the rightmost digit of the next marked number on
# the line and loop, otherwise we're done with the first marked number, we can
# clean its marker, and we can loop
b stencilNextNumber
:didStencilNextNumber?
t nextNumberStenciled
b removeStencilAndFirstMarker
:removeStencilAndFirstMarkerDone
b stencilLeastDigitOfNextMarkedNumber

###########################################################
# puts a \n on each side of the first digit marked on the right by \x0
:stencilLeastDigitOfNextMarkedNumber
tr
:r
s/([0-9])\x0;/\n\1\n\x0;/1
b didStencilLeastDigitOfNextMarkedNumber?

###########################################################
# makes desired sed script skeleton from space-separated numbers
:makeSkeletonAndMarkNumbers
s/$/ $/
s/([1-9]+|\$) +?/1,\1{w file.0\x0;be};/g
s/$/:e/
b skeletonMade

###########################################################
# moves the stencil to the next number followed by \x0
:stencilNextNumber
trr
:rr
s/\n(.)\n([^\x0]*\x0[^\x0]+)([0-9])\x0/\1\2\n\3\n\x0/
b didStencilNextNumber?

###########################################################
# +1 with carry to last digit on the line enclosed in between two \n characters
:numberAdd1
#i\
#\nprima della somma:
#l
:digitPlus1
h
s/.*\n([0-9])\n.*/\1/
y/0123456789/1234567890/
G
s/(.)\n(.*)\n.\n/\2\n\1\n/
trrr
:rrr
/[0-9]\n0\n/s/(.)\n0\n/\n\1\n0/
t digitPlus1
# the following line can be problematic for lines starting with number
/[^0-9]\n0\n/s/(.)\n0\n/\n\1\n10/
b numberAdded1

###########################################################
# remove stencil and first marker on line
:removeStencilAndFirstMarker
s/\n(.)\n/\1/
s/\x0//
b removeStencilAndFirstMarkerDone

###########################################################
:exit
# a bit of post processing the `w` command has to be followed
# by the filename, then by a newline, so we change the appropriate `;`s to `\n`.
s/(\{[^;]+);/\1\n/g


来源:https://stackoverflow.com/questions/62953828/split-file-by-vector-of-line-numbers

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!