Which AWK program can do this manipulation?

问题

Given a file containing a structure arranged like the following (with fields separated by SP or HT)

4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w

Which AWK program do I need to get the following output?

Thanks in advance for any and all help.

Postscript

Please bear in mind,

My input file is much larger than the one depicted in this question.
My computer science skills are seriously limited.
This task has been imposed on me.

回答1:

awk -v n=4 '
    function join(start, end,    result, i) {
        for (i=start; i<=end; i++)
            result = result $i (i==end ? ORS : FS)
        return result
    }
    {
        c=0
        for (i=1; i<NF; i+=n) {
            c++
            col[c] = col[c] join(i, i+n-1)
        }
    }
    END {
        for (i=1; i<=c; i++)
            printf "%s", col[i]  # the value already ends with newline
    }
' file

The awk info page has a short primer on awk, so read that too.

Benchmarking

create an input file with 1,000,000 columns and 8 rows (as specified by OP)

#!perl
my $cols = 2**20; # 1,048,576
my $rows = 8;
my @alphabet=( 'a'..'z', 0..9 );
my $size = scalar @alphabet;

for ($r=1; $r <= $rows; $r++) {
    for ($c = 1; $c <= $cols; $c++) {
        my $idx = int rand $size;
        printf "%s ", $alphabet[$idx];
    }
    printf "\n";
}

$ perl createfile.pl > input.file
$ wc input.file
       8  8388608 16777224 input.file

time various implementations: I use the fish shell, so the timing output is different from bash's

my awk

$ time awk -f columnize.awk -v n=4 input.file > output.file

________________________________________________________
Executed in    3.62 secs   fish           external
   usr time    3.49 secs    0.24 millis    3.49 secs
   sys time    0.11 secs    1.96 millis    0.11 secs

$ wc output.file
 2097152  8388608 16777216 output.file

Timur's perl:

$ time perl -lan columnize.pl input.file > output.file

________________________________________________________
Executed in    3.25 secs   fish           external
   usr time    2.97 secs    0.16 millis    2.97 secs
   sys time    0.27 secs    2.87 millis    0.27 secs

Ravinder's awk

$ time awk -f columnize.ravinder input.file > output.file

________________________________________________________
Executed in    4.01 secs   fish           external
   usr time    3.84 secs    0.18 millis    3.84 secs
   sys time    0.15 secs    3.75 millis    0.14 secs

kvantour's awk, first version

$ time awk -f columnize.kvantour -v n=4 input.file > output.file

________________________________________________________
Executed in    3.84 secs   fish           external
   usr time    3.71 secs  166.00 micros    3.71 secs
   sys time    0.11 secs  1326.00 micros    0.11 secs

kvantour's second awk version: Crtl-C interrupted after a few minutes

$ time awk -f columnize.kvantour2 -v n=4 input.file > output.file
^C
________________________________________________________
Executed in  260.80 secs   fish           external
   usr time  257.39 secs    0.13 millis  257.39 secs
   sys time    1.68 secs    2.72 millis    1.67 secs

$ wc output.file
 9728 38912 77824 output.file

The $0=a[j] line is pretty expensive, as it has to parse the string into fields each time.

dawg's python

$ timeout 60s fish -c 'time python3 columnize.py input.file 4 > output.file'
[... 60 seconds later ...]
$ wc output.file
 2049  8196 16392 output.file

another interesting data point: using different awk implementations. I'm on a Mac with GNU awk and mawk installed via homebrew

with many columns, few rows

$ time gawk -f columnize.awk -v n=4 input.file > output.file

________________________________________________________
Executed in    3.78 secs   fish           external
   usr time    3.62 secs  174.00 micros    3.62 secs
   sys time    0.13 secs  1259.00 micros    0.13 secs

$ time /usr/bin/awk -f columnize.awk -v n=4 input.file > output.file

________________________________________________________
Executed in   17.73 secs   fish           external
   usr time   14.95 secs    0.20 millis   14.95 secs
   sys time    2.72 secs    3.45 millis    2.71 secs

$ time mawk -f columnize.awk -v n=4 input.file > output.file

________________________________________________________
Executed in    2.01 secs   fish           external
   usr time  1892.31 millis    0.11 millis  1892.21 millis
   sys time   95.14 millis    2.17 millis   92.97 millis

with many rows, few columns, this test took over half an hour on a MacBook Pro, 6 core Intel cpu, 16GB ram

$ time mawk -f columnize.awk -v n=4 input.file > output.file

________________________________________________________
Executed in   32.30 mins   fish           external
   usr time   23.58 mins    0.15 millis   23.58 mins
   sys time    8.63 mins    2.52 millis    8.63 mins

回答2:

Use this Perl script:

perl -lane '
push @rows, [@F];
END {
    my $delim = "\t";
    my $cols_per_group = 2;
    my $col_start = 0;
    while ( 1 ) {
         for my $row ( @rows ) {
             print join $delim, @{$row}[ $col_start .. ($col_start + $cols_per_group - 1) ];
         }
         $col_start += $cols_per_group;
         last if ($col_start + $cols_per_group - 1) > $#F;
    } 
}
' in_file > out_file

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array @F on whitespace or on the regex specified in -F option.

This script reads the file into memory. This is okay for most modern computers and the file sizes in question.

Each line is split on whitespace (use -F'\t' for TAB as delimiter) into array @F. The references to this array for each line are stored as elements in array @rows. After the file is read, and the end of the script (in the END { ... } block), the contents of the file are printed in groups of columns, with $cols_per_group columns per group. Columns are delimited by $delim.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

回答3:

Could you please try following, written and tested with ONLY shown samples in GNU awk.

awk '
{
  for(i=1;i<=NF;i+=2){
    arr[i]=(arr[i]?arr[i] ORS :"")$i OFS $(i+1)
  }
}
END{
  for(i=1;i<=NF;i+=2){
    print arr[i]
  }
}' Input_file

回答4:

Since we all love awk, here is another one:

awk -v n=2 '{for(i=1;i<=NF;++i) { j=int((i-1)/n); a[j] = a[j] $i (i%n==0 || i==NF ?ORS:OFS) }}
            END{for(j=0;j<=int(NF/n);j++) printf "%s", a[j]}'

This will output exactly what is requested by the OP.

How does it work?

Awk performs actions per record/line it reads. For each record, it processes all the fields and appends them to a set of strings stored in an array a. It processes it in such way that a[1] contains the first n columns. a[2] the second set of n columns, etc. The relation between field number and string number is based on the equation int((i-1)/n).

When creating the strings, we try to keep track if we need to add a field separator OFS or a new line (record separator ORS). We decide this based on the modulus of the field number and the total number of columns we expect (i.e. n). Note, that we always use ORS if we process the last field.

An alternative approach: Thanks to the comment Dawg, a flaw in the above code was found. He informed us that the program scales really badly when moving to large files. The real reason for this is not 100% known, but I assume it is from constantly having to rewrite memory by doing operations as a[j] = a[j] $i (i%n==0 || i==NF ?ORS:OFS). This can be eliminated by just buffering the entire file and do all operations in the end:

awk -v n=2 '{a[NR]=$0}
            END{ for(i=1;i<=NF;i+=n)
                   for(j=1;j<=NR;++j) {
                      $0=a[j]
                      for(k=0;k<n&&(i+k<=NF);++k)
                         printf "%s%s", $(i+k), ((i+k==NF || ((i+k) % n == 0)) ? ORS : OFS) 
                   }
            }' file

Note: the latter seems only efficient for a small amount of columns. This is because of the constant re-splitting done with $0=a[j]. The split takes much more time due to the large amount of fields. The complexity of this system is O(NF^2*NR)

A final alternative approach: While the first solution is fast for large amount of columns and small amount of rows, the second is fast for small amount of columns and large amount of rows. Below you find a final version that is not as fast, but stable and produces the same timing for the transposed files.

awk -v n=2 '{ for(i=1;i<=NF;i+=n) {
                s=""
                for(k=0;k<n&&(i+k<=NF);++k) 
                   s=s  $(i+k)  ((i+k==NF || ((i+k) % n == 0)) ? "ORS" : OFS);
                a[i,NR]=s 
             }
            }
            END{for(i=1;i<=NF;i+=n)for(j=1;j<=NR;++j) printf "%s",a[i,j]}' file

回答5:

My old answer is below and no longer applicable...

You can use this awk for a file that could be millions of rows or millions of columns. The basic scheme is to suck all the values into a single array then use indexing arithmetic and nested loops at the end to get the correct order:

$ cat col.awk
{
    for (i=1; i<=NF; i++) {
        vals[++numVals] = $i
        }
    }
    END {
    for(col_offset=0; col_offset + cols <= NF; col_offset+=cols) {
        for (i=1; i<=numVals; i+=NF) {
            for(j=0; j<cols; j++) {
                printf "%s%s", vals[i+j+col_offset], (j<cols-1 ? FS : ORS)
                }
        } 
    }
}

$ awk -f col.awk -v cols=2 file
4 5
m d
t 7
h 5
...
3 4
4 1
d f
5 9
q w

My old answer is based on the dramatic slowdown seen in most of these awks with a large number of rows.

See this question for more discussion regarding the slowdown.

The comments in the original answer below are no longer applicable.

OLD ANSWER

Only here for consistency...

The awk solutions here are all good for small files. What they all have in common is the file either needs to fit in RAM or the OS virtual memory is an acceptable fall back if the file does not fit in RAM. But with a larger file, since the time of the awk increases exponentially, you can have a very bad result. With a 12MB version of your file, an in-memory awk becomes unusably slow.

This is the case if there are millions of rows not millions of columns.

The only alternative to an in-memory solution is reading the file multiple times or managing temp files yourself. (Or use a scripting language that manages VM internally, such as Perl or Python... Timur Shatland's Perl is fast even on huge files.)

awk does not have an easy mechanism to loop over a file multiple times until a process is done. You would need to use the shell to do that and invoke awk multiple times.

Here is a python script that reads a file line-by-line and prints cols number of columns until all the original columns have been printed

$ cat sys.py
import sys
filename=sys.argv[1]
cols=int(sys.argv[2])
offset=0
delimiter=" "

with open(filename, "r") as f:
    max_cols=len(f.readline().split())

while offset<max_cols:
    with open(filename, "r") as f:  
        for line in f:
            col_li=line.rstrip().split()
            l=len(col_li)
            max_cols=l if max_cols>l else l
            print(delimiter.join(col_li[offset:offset+cols]))

        offset+=cols

It is counterintuitive, but it is often significantly faster and more efficient to read a file multiple times than it is to gulp the entire thing -- if that gulp then results in a bad result with larger data.

So how does this perform compared to one of the awks in this post? Let's time it.

Given your example, the in memory awk will likely be faster:

$ cat file
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w

$ time python pys.py file 2 >file2
real    0m0.027s
user    0m0.009s
sys 0m0.016s

$ time awk -v n=2 '{for(i=1;i<=NF;++i) { 
  j=int((i-1)/n); a[j] = a[j] $i (i%n==0 || i==NF ?ORS:OFS) }}
  END{for(j=0;j<=int(NF/n);j++) printf "%s", a[j]}' file >file3
real    0m0.009s
user    0m0.003s
sys 0m0.003s

And that is true. BUT, let's make the file 1000x bigger with this Python script:

txt='''\
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
'''

with open('/tmp/file', 'w') as f:
    f.write(txt*1000)              # change the 1000 to the multiple desired

# file will have 8000 lines and about 125KB

Rerun those timings, same way, and you get:

#python
real    0m0.061s
user    0m0.044s
sys 0m0.015s

# awk
real    0m0.050s
user    0m0.043s
sys 0m0.004s

About the same time... Now make the file BIGGER by multiplying the original by 100,000 to get 800,000 lines and 12MB and run the timing again:

# python
real    0m3.475s
user    0m3.434s
sys 0m0.038s

#awk
real    22m45.118s
user    16m40.221s
sys 6m4.652s

With a 12MB file, the in memory method becomes essentially unusable since the VM system on this computer with is subject to massive disc swapping to manage that particular type of memory allocation. It is likely O n**2 or worse. This computer is a 2019 Mac Pro 16 core Xeon with 192GB memory, so it is not hardware...

来源：https://stackoverflow.com/questions/65598986/which-awk-program-can-do-this-manipulation

标签

awk