How to split a file and keep the first line in each of the pieces?

前端未结

关注

 12  785

-上瘾入骨i

Given: One big text-data file (e.g. CSV format) with a \'special\' first line (e.g., field names).

Wanted: An equivalent of the cor

相关标签:

12条回答

梦如初夏

2020-12-07 18:47
This one-liner will split the big csv into pieces of 999 records, with the header at the top of each one (so 999 records + 1 header = 1000 rows)
```
cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
```
Based on Ole Tange's answer. (re Ole's answer: You can't use line count with pipepart)
0 讨论(0)
发布评论:

提交评论
- 加载中...
难免孤独

2020-12-07 18:48
This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.
```
trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT 
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat $file >> tmp_file
    mv -f tmp_file $file
done
```
Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2020-12-07 18:52
Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.
```
csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00
```
Line by line explanation:
1. Capture the header to a variable named csvheader
2. Split the bigfile.csv into a number of smaller files with prefix smallfile_
3. Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
4. The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.
0 讨论(0)
发布评论:

提交评论
- 加载中...

情书的邮戳

2020-12-07 18:59

You can use [mg]awk:

awk 'NR==1{
        header=$0; 
        count=1; 
        print header > "x_" count; 
        next 
     } 

     !( (NR-1) % 100){
        count++; 
        print header > "x_" count;
     } 
     {
        print $0 > "x_" count
     }' file

100 is the number of lines of each slice. It doesn't require temp files and can be put on a single line.

0 讨论(0)

北荒

2020-12-07 18:59
I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.
```
$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
```
This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.
0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2020-12-07 18:59
I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:
```
awk 'NR==1{print $0 > FILENAME ".split1";  print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页