How to split a file and keep the first line in each of the pieces?

前端 未结 12 764
-上瘾入骨i
-上瘾入骨i 2020-12-07 18:28

Given: One big text-data file (e.g. CSV format) with a \'special\' first line (e.g., field names).

Wanted: An equivalent of the cor

相关标签:
12条回答
  • 2020-12-07 18:59

    I really liked Rob and Dennis' versions, so much so that I wanted to improve them.

    Here's my version:

    in_file=$1
    awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
    for file in $in_file"_"*
    do
        tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
        head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
        mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
    done
    

    Differences:

    1. in_file is the file argument you want to split maintaining headers
    2. Use awk instead of tail due to awk having better performance
    3. split into 100,000 line files instead of 4
    4. Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
    5. Use mktemp to safely handle temporary files
    6. Use single head | cat line instead of two lines
    0 讨论(0)
  • 2020-12-07 19:00

    This is robhruska's script cleaned up a bit:

    tail -n +2 file.txt | split -l 4 - split_
    for file in split_*
    do
        head -n 1 file.txt > tmp_file
        cat "$file" >> tmp_file
        mv -f tmp_file "$file"
    done
    

    I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.

    If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.

    Edit

    Using GNU split it's possible to do this:

    split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
    

    Broken out for readability:

    split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
    export -f split_filter
    tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
    

    When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.

    A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.

    0 讨论(0)
  • 2020-12-07 19:04

    You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):

    tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'
    
    0 讨论(0)
  • 2020-12-07 19:05

    Use GNU Parallel:

    parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
    

    If you need to run a command on each of the parts, then GNU Parallel can help do that, too:

    parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
    parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
    parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
    

    If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):

    parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
    

    If you want to split into 10 MB blocks:

    parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
    
    0 讨论(0)
  • 2020-12-07 19:06

    A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in. So something like:

    1. head -n1 file.txt > header.txt
    2. split -l file.txt
    3. cat header.txt f1.txt
    0 讨论(0)
  • 2020-12-07 19:09

    Inspired by @Arkady's comment on a one-liner.

    • MYFILE variable simply to reduce boilerplate
    • split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
    • removal of intermediate files via rm $part (assumes no files with same suffix)

    MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done

    Evidence:

    -rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xaafoo
    -rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xabfoo
    -rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xacfoo
    -rw-rw-r--  1 ec2-user ec2-user  32040110 Jun  1 23:18 mycsv.csv.xadfoo
    

    and of course head -2 *foo to see the header is added.

    0 讨论(0)
提交回复
热议问题