How do I get gawk to transpose my data into a csv file

倾然丶 夕夏残阳落幕 提交于 2021-01-07 02:58:28

问题


I have a bunch of input text files that look like this:

   measured       5.7       0.0000    0.0000    0.0125    0.0161    0.0203    0.0230    0.0233    0.0236    0.0241
                            0.0243    0.0239    0.0235    0.0226    0.0207    0.0184    0.0147    0.0000    0.0000


   measured       7.4       0.0000    0.0000    0.0160    0.0207    0.0260    0.0295    0.0298    0.0302    0.0308
                            0.0311    0.0306    0.0300    0.0289    0.0264    0.0235    0.0187    0.0000    0.0000

Each file has a couple of lines like that.

I want to take all of these files, cut out the 'measured' and first number (eg. 5.7 and 7.4) and put them in a CSV file so they will be sorted into columns like this:

My gawk command is

BEGIN { OFS = "\n" }
/measured/ { c=2; $1=$2=""; $0=$0 }
c && c-- { $1=$1; print }

Which I run as part of a for loop in windows for %f in (*) do (gawk -f column.txt %f ) >> finaloutput\burnup.csv

And that just produces a long column of numbers like this:

How do I get gawk to transpose the data into separate columns instead of one big long column?


回答1:


$1=$2="";

this was the problem. you threw away exactly what you wasted (the 5.7) by having "measured" and "5.7" both go blank. The rest of the work became a futile exercise of re-cranking fields without the data you need.

c=2 only set the times you want to loop it it didn't save any of the variables before the =""

Also, any particular reason why you need a loop for those files? just cat (or pv, my preferred method) it over to gawk (i think mawk2-beta might run this at least 60-70% faster than gawk 5.1.0. AWK is the insane tool that has a much easier time than python or R when it comes go shoving it with gigs at a time.

Perhaps also consider using gnu grep (not the gd BSD one like on my mac - the syntax and escaping required just to match ABC \n XYZ on that thing is surreal) ?

I think this might be one of those great use cases for the --text flag (so it won't go screaming on irregularity bytes here and there), the -h flag (when you're sourcing from multiple files), and the -o flag to capture only the portion you need. But i'm not 100% too sure if a concurrent grep like that might exhibit any inconsistencies or not ?

i'm super novice in python or perl so can't help much there




回答2:


Try this new solution. Tested on mawk2-beta should work elsewhere. This should handle the transpose automatically without having to use multi-dimensional arrays because the regex inside RS already will flatten that whole thing into 1 column.

Setting TOTCOL should only occur once. which is handled by setting FS to newline, so whenever it's at end of the line, FS is at least 2, if not more.

Furthermore, it doesn't need to assume it's 9 or 11 columns at all. It'll be auto-computed. So unless the input file has more than 1 billion columns, this trick be an issue.

I backed up the NR by 2 initially so the % wouldn't loop back to 0 one spot too early.

mawk2 'BEGIN { TOTCOL = 1E9; NR -= 2; FS = ORS; OFS = "";

           RS = "([\n]*[ \t]+measured[ \t]+[^ \t]+)?[ \t]+"; 
       } 
       (NR < TOTCOL && NF > 1) { 
                                 TOTCOL = 2 * length(outS); } 
       (NR == TOTCOL)          { 
                                 OFS = "\t" ; }
       { 
          outS[NR % TOTCOL] = outS[NR % TOTCOL] OFS $1 ; } 
   END { 
         for (trspd in outS) { 
                                print outS[trspd]; } }' 



回答3:


Finally fixed all the issues. This will work across gawk / mawk-1.3 / mawk2. Reason being mawk-1 must initialize the array first else it'll whack out.

gawk/mawk/mawk2 'BEGIN { TOTCOL = 1E8; FS = "[\n]+"; NR -= 2; OFS = "";
    
     RS = "((^|[\n]+)[ \t]+measured[ \t]+[^ \t]+)?[ \t]+";  
     outS[""]++; 
     delete outS[""];
  
  } ( NR  < TOTCOL && NF > 1 ) { TOTCOL = 2 * length(outS);    
  } ( NR == TOTCOL           ) { delete outS[-1];
                                 OFS = "\t" ;
  } { outS[NR%TOTCOL] = \
                        outS[NR%TOTCOL] OFS $1 ; 
  } END { trspd = 0; 
          nx = length(outS); 

          do { 
               print outS[trspd];
          
          } while (++trspd < nx) }'



回答4:


Here's one:

$ awk -v RS="" '{                       # read empty line separated blocks
    for(i=3;i<=NF;i++)                  # loop from 3rd field to the end
        a[i]=a[i] sprintf("%10s",$i+0)  # append to NF in indexed array elements
}
END {                                   # in the end
    for(i=3;i in a;i++)                 # output
        print a[i]
}' file

Output:

     0         0
     0         0
0.0125     0.016
0.0161    0.0207
0.0203     0.026
 0.023    0.0295
0.0233    0.0298
0.0236    0.0302
0.0241    0.0308
0.0243    0.0311
0.0239    0.0306
0.0235      0.03
0.0226    0.0289
0.0207    0.0264
0.0184    0.0235
0.0147    0.0187
     0         0
     0         0

It's dumb (since I'm lazy) in the sense that sprintf uses static 10 spaces for the right-justified output.




回答5:


Perhaps this will do what you want?

awk 'NR%2{printf "%s ",$0;next;}1' input_text_file.csv | awk '{for (i=1; i<=NF; i++) {a[NR,i] = $i} } NF>p { p = NF } END {for(j=1; j<=p; j++) {str=a[1,j]; for(i=2; i<=NR; i++){str=str" "a[i,j];} print str}}'


来源:https://stackoverflow.com/questions/65364966/how-do-i-get-gawk-to-transpose-my-data-into-a-csv-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!