How to split a file and keep the first line in each of the pieces?

前端 未结 12 763
-上瘾入骨i
-上瘾入骨i 2020-12-07 18:28

Given: One big text-data file (e.g. CSV format) with a \'special\' first line (e.g., field names).

Wanted: An equivalent of the cor

相关标签:
12条回答
  • 2020-12-07 18:47

    This one-liner will split the big csv into pieces of 999 records, with the header at the top of each one (so 999 records + 1 header = 1000 rows)

    cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
    

    Based on Ole Tange's answer. (re Ole's answer: You can't use line count with pipepart)

    0 讨论(0)
  • 2020-12-07 18:48

    This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.

    trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT 
    tail -n +2 file.txt | split -l 4 - split_
    for file in split_*
    do
        head -n 1 file.txt > tmp_file
        cat $file >> tmp_file
        mv -f tmp_file $file
    done
    

    Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.

    0 讨论(0)
  • 2020-12-07 18:52

    Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.

    
    csvheader=`head -1 bigfile.csv`
    split -d -l10000 bigfile.csv smallfile_
    find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
    sed -i '1d' smallfile_00
    
    

    Line by line explanation:

    1. Capture the header to a variable named csvheader
    2. Split the bigfile.csv into a number of smaller files with prefix smallfile_
    3. Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
    4. The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.
    0 讨论(0)
  • 2020-12-07 18:59

    You can use [mg]awk:

    awk 'NR==1{
            header=$0; 
            count=1; 
            print header > "x_" count; 
            next 
         } 
    
         !( (NR-1) % 100){
            count++; 
            print header > "x_" count;
         } 
         {
            print $0 > "x_" count
         }' file
    

    100 is the number of lines of each slice. It doesn't require temp files and can be put on a single line.

    0 讨论(0)
  • 2020-12-07 18:59

    I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.

    $> tail -n +2 file.txt | split -l 4
    $> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
    

    This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.

    0 讨论(0)
  • 2020-12-07 18:59

    I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:

    awk 'NR==1{print $0 > FILENAME ".split1";  print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file
    
    0 讨论(0)
提交回复
热议问题