问题
Given a file containing a structure arranged like the following (with fields separated by SP or HT)
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
Which AWK program do I need to get the following output?
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2
6 7
4 2
6 2
7 1
9 0
a 2
3 2
9 8
9 5
4 2
5 s
2 2
5 6
3 4
1 4
4 8
4 g
5 3
3 4
4 1
d f
5 9
q w
Thanks in advance for any and all help.
Postscript
Please bear in mind,
My input file is much larger than the one depicted in this question.
My computer science skills are seriously limited.
This task has been imposed on me.
回答1:
awk -v n=4 '
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{
c=0
for (i=1; i<NF; i+=n) {
c++
col[c] = col[c] join(i, i+n-1)
}
}
END {
for (i=1; i<=c; i++)
printf "%s", col[i] # the value already ends with newline
}
' file
The awk info page has a short primer on awk, so read that too.
Benchmarking
create an input file with 1,000,000 columns and 8 rows (as specified by OP)
#!perl my $cols = 2**20; # 1,048,576 my $rows = 8; my @alphabet=( 'a'..'z', 0..9 ); my $size = scalar @alphabet; for ($r=1; $r <= $rows; $r++) { for ($c = 1; $c <= $cols; $c++) { my $idx = int rand $size; printf "%s ", $alphabet[$idx]; } printf "\n"; }$ perl createfile.pl > input.file $ wc input.file 8 8388608 16777224 input.filetime various implementations: I use the fish shell, so the timing output is different from bash's
my awk
$ time awk -f columnize.awk -v n=4 input.file > output.file ________________________________________________________ Executed in 3.62 secs fish external usr time 3.49 secs 0.24 millis 3.49 secs sys time 0.11 secs 1.96 millis 0.11 secs $ wc output.file 2097152 8388608 16777216 output.fileTimur's perl:
$ time perl -lan columnize.pl input.file > output.file ________________________________________________________ Executed in 3.25 secs fish external usr time 2.97 secs 0.16 millis 2.97 secs sys time 0.27 secs 2.87 millis 0.27 secsRavinder's awk
$ time awk -f columnize.ravinder input.file > output.file ________________________________________________________ Executed in 4.01 secs fish external usr time 3.84 secs 0.18 millis 3.84 secs sys time 0.15 secs 3.75 millis 0.14 secskvantour's awk, first version
$ time awk -f columnize.kvantour -v n=4 input.file > output.file ________________________________________________________ Executed in 3.84 secs fish external usr time 3.71 secs 166.00 micros 3.71 secs sys time 0.11 secs 1326.00 micros 0.11 secskvantour's second awk version: Crtl-C interrupted after a few minutes
$ time awk -f columnize.kvantour2 -v n=4 input.file > output.file ^C ________________________________________________________ Executed in 260.80 secs fish external usr time 257.39 secs 0.13 millis 257.39 secs sys time 1.68 secs 2.72 millis 1.67 secs $ wc output.file 9728 38912 77824 output.fileThe
$0=a[j]line is pretty expensive, as it has to parse the string into fields each time.dawg's python
$ timeout 60s fish -c 'time python3 columnize.py input.file 4 > output.file' [... 60 seconds later ...] $ wc output.file 2049 8196 16392 output.file
another interesting data point: using different awk implementations. I'm on a Mac with GNU awk and mawk installed via homebrew
with many columns, few rows
$ time gawk -f columnize.awk -v n=4 input.file > output.file ________________________________________________________ Executed in 3.78 secs fish external usr time 3.62 secs 174.00 micros 3.62 secs sys time 0.13 secs 1259.00 micros 0.13 secs$ time /usr/bin/awk -f columnize.awk -v n=4 input.file > output.file ________________________________________________________ Executed in 17.73 secs fish external usr time 14.95 secs 0.20 millis 14.95 secs sys time 2.72 secs 3.45 millis 2.71 secs$ time mawk -f columnize.awk -v n=4 input.file > output.file ________________________________________________________ Executed in 2.01 secs fish external usr time 1892.31 millis 0.11 millis 1892.21 millis sys time 95.14 millis 2.17 millis 92.97 milliswith many rows, few columns, this test took over half an hour on a MacBook Pro, 6 core Intel cpu, 16GB ram
$ time mawk -f columnize.awk -v n=4 input.file > output.file ________________________________________________________ Executed in 32.30 mins fish external usr time 23.58 mins 0.15 millis 23.58 mins sys time 8.63 mins 2.52 millis 8.63 mins
回答2:
Use this Perl script:
perl -lane '
push @rows, [@F];
END {
my $delim = "\t";
my $cols_per_group = 2;
my $col_start = 0;
while ( 1 ) {
for my $row ( @rows ) {
print join $delim, @{$row}[ $col_start .. ($col_start + $cols_per_group - 1) ];
}
$col_start += $cols_per_group;
last if ($col_start + $cols_per_group - 1) > $#F;
}
}
' in_file > out_file
The Perl one-liner uses these command line flags:-e : Tells Perl to look for code in-line, instead of in a file.-n : Loop over the input one line at a time, assigning it to $_ by default.-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.-a : Split $_ into array @F on whitespace or on the regex specified in -F option.
This script reads the file into memory. This is okay for most modern computers and the file sizes in question.
Each line is split on whitespace (use -F'\t' for TAB as delimiter) into array @F. The references to this array for each line are stored as elements in array @rows. After the file is read, and the end of the script (in the END { ... } block), the contents of the file are printed in groups of columns, with $cols_per_group columns per group. Columns are delimited by $delim.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
回答3:
Could you please try following, written and tested with ONLY shown samples in GNU awk.
awk '
{
for(i=1;i<=NF;i+=2){
arr[i]=(arr[i]?arr[i] ORS :"")$i OFS $(i+1)
}
}
END{
for(i=1;i<=NF;i+=2){
print arr[i]
}
}' Input_file
回答4:
Since we all love awk, here is another one:
awk -v n=2 '{for(i=1;i<=NF;++i) { j=int((i-1)/n); a[j] = a[j] $i (i%n==0 || i==NF ?ORS:OFS) }}
END{for(j=0;j<=int(NF/n);j++) printf "%s", a[j]}'
This will output exactly what is requested by the OP.
How does it work?
Awk performs actions per record/line it reads. For each record, it processes all the fields and appends them to a set of strings stored in an array a. It processes it in such way that a[1] contains the first n columns. a[2] the second set of n columns, etc. The relation between field number and string number is based on the equation int((i-1)/n).
When creating the strings, we try to keep track if we need to add a field separator OFS or a new line (record separator ORS). We decide this based on the modulus of the field number and the total number of columns we expect (i.e. n). Note, that we always use ORS if we process the last field.
An alternative approach:
Thanks to the comment Dawg, a flaw in the above code was found. He informed us that the program scales really badly when moving to large files. The real reason for this is not 100% known, but I assume it is from constantly having to rewrite memory by doing operations as a[j] = a[j] $i (i%n==0 || i==NF ?ORS:OFS). This can be eliminated by just buffering the entire file and do all operations in the end:
awk -v n=2 '{a[NR]=$0}
END{ for(i=1;i<=NF;i+=n)
for(j=1;j<=NR;++j) {
$0=a[j]
for(k=0;k<n&&(i+k<=NF);++k)
printf "%s%s", $(i+k), ((i+k==NF || ((i+k) % n == 0)) ? ORS : OFS)
}
}' file
Note: the latter seems only efficient for a small amount of columns. This is because of the constant re-splitting done with $0=a[j]. The split takes much more time due to the large amount of fields. The complexity of this system is O(NF^2*NR)
A final alternative approach: While the first solution is fast for large amount of columns and small amount of rows, the second is fast for small amount of columns and large amount of rows. Below you find a final version that is not as fast, but stable and produces the same timing for the transposed files.
awk -v n=2 '{ for(i=1;i<=NF;i+=n) {
s=""
for(k=0;k<n&&(i+k<=NF);++k)
s=s $(i+k) ((i+k==NF || ((i+k) % n == 0)) ? "ORS" : OFS);
a[i,NR]=s
}
}
END{for(i=1;i<=NF;i+=n)for(j=1;j<=NR;++j) printf "%s",a[i,j]}' file
回答5:
My old answer is below and no longer applicable...
You can use this awk for a file that could be millions of rows or millions of columns. The basic scheme is to suck all the values into a single array then use indexing arithmetic and nested loops at the end to get the correct order:
$ cat col.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for(col_offset=0; col_offset + cols <= NF; col_offset+=cols) {
for (i=1; i<=numVals; i+=NF) {
for(j=0; j<cols; j++) {
printf "%s%s", vals[i+j+col_offset], (j<cols-1 ? FS : ORS)
}
}
}
}
$ awk -f col.awk -v cols=2 file
4 5
m d
t 7
h 5
...
3 4
4 1
d f
5 9
q w
My old answer is based on the dramatic slowdown seen in most of these awks with a large number of rows.
See this question for more discussion regarding the slowdown.
The comments in the original answer below are no longer applicable.
OLD ANSWER
Only here for consistency...
The awk solutions here are all good for small files. What they all have in common is the file either needs to fit in RAM or the OS virtual memory is an acceptable fall back if the file does not fit in RAM. But with a larger file, since the time of the awk increases exponentially, you can have a very bad result. With a 12MB version of your file, an in-memory awk becomes unusably slow.
This is the case if there are millions of rows not millions of columns.
The only alternative to an in-memory solution is reading the file multiple times or managing temp files yourself. (Or use a scripting language that manages VM internally, such as Perl or Python... Timur Shatland's Perl is fast even on huge files.)
awk does not have an easy mechanism to loop over a file multiple times until a process is done. You would need to use the shell to do that and invoke awk multiple times.
Here is a python script that reads a file line-by-line and prints cols number of columns until all the original columns have been printed
$ cat sys.py
import sys
filename=sys.argv[1]
cols=int(sys.argv[2])
offset=0
delimiter=" "
with open(filename, "r") as f:
max_cols=len(f.readline().split())
while offset<max_cols:
with open(filename, "r") as f:
for line in f:
col_li=line.rstrip().split()
l=len(col_li)
max_cols=l if max_cols>l else l
print(delimiter.join(col_li[offset:offset+cols]))
offset+=cols
It is counterintuitive, but it is often significantly faster and more efficient to read a file multiple times than it is to gulp the entire thing -- if that gulp then results in a bad result with larger data.
So how does this perform compared to one of the awks in this post? Let's time it.
Given your example, the in memory awk will likely be faster:
$ cat file
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
$ time python pys.py file 2 >file2
real 0m0.027s
user 0m0.009s
sys 0m0.016s
$ time awk -v n=2 '{for(i=1;i<=NF;++i) {
j=int((i-1)/n); a[j] = a[j] $i (i%n==0 || i==NF ?ORS:OFS) }}
END{for(j=0;j<=int(NF/n);j++) printf "%s", a[j]}' file >file3
real 0m0.009s
user 0m0.003s
sys 0m0.003s
And that is true. BUT, let's make the file 1000x bigger with this Python script:
txt='''\
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
'''
with open('/tmp/file', 'w') as f:
f.write(txt*1000) # change the 1000 to the multiple desired
# file will have 8000 lines and about 125KB
Rerun those timings, same way, and you get:
#python
real 0m0.061s
user 0m0.044s
sys 0m0.015s
# awk
real 0m0.050s
user 0m0.043s
sys 0m0.004s
About the same time... Now make the file BIGGER by multiplying the original by 100,000 to get 800,000 lines and 12MB and run the timing again:
# python
real 0m3.475s
user 0m3.434s
sys 0m0.038s
#awk
real 22m45.118s
user 16m40.221s
sys 6m4.652s
With a 12MB file, the in memory method becomes essentially unusable since the VM system on this computer with is subject to massive disc swapping to manage that particular type of memory allocation. It is likely O n**2 or worse. This computer is a 2019 Mac Pro 16 core Xeon with 192GB memory, so it is not hardware...
来源:https://stackoverflow.com/questions/65598986/which-awk-program-can-do-this-manipulation