How to split a file and keep the first line in each of the pieces?

前端未结

关注

 12  764

-上瘾入骨i

Given: One big text-data file (e.g. CSV format) with a \'special\' first line (e.g., field names).

Wanted: An equivalent of the cor

相关标签:

12条回答

忘掉有多难

2020-12-07 18:59
I really liked Rob and Dennis' versions, so much so that I wanted to improve them.

Here's my version:
```
in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
    tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
    head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
    mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
```
Differences:
1. in_file is the file argument you want to split maintaining headers
2. Use awk instead of tail due to awk having better performance
3. split into 100,000 line files instead of 4
4. Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
5. Use mktemp to safely handle temporary files
6. Use single head | cat line instead of two lines
0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-12-07 19:00
This is robhruska's script cleaned up a bit:
```
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat "$file" >> tmp_file
    mv -f tmp_file "$file"
done
```
I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.

If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.

Edit

Using GNU split it's possible to do this:
```
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
```
Broken out for readability:
```
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
```
When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.

A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人共我

2020-12-07 19:04
You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):
```
tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

无人共我

2020-12-07 19:05

Use GNU Parallel:

parallel -a bigfile.csv --header : --pipepart 'cat > {#}'

If you need to run a command on each of the parts, then GNU Parallel can help do that, too:

parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}

If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):

parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

If you want to split into 10 MB blocks:

parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

0 讨论(0)

终归单人心

2020-12-07 19:06
A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in. So something like:
1. head -n1 file.txt > header.txt
2. split -l file.txt
3. cat header.txt f1.txt
0 讨论(0)
发布评论:

提交评论
- 加载中...
北荒

2020-12-07 19:09
Inspired by @Arkady's comment on a one-liner.
- MYFILE variable simply to reduce boilerplate
- split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
- removal of intermediate files via rm $part (assumes no files with same suffix)
MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done

Evidence:
```
-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xaafoo
-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xabfoo
-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xacfoo
-rw-rw-r--  1 ec2-user ec2-user  32040110 Jun  1 23:18 mycsv.csv.xadfoo
```
and of course head -2 *foo to see the header is added.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2