UPDATE
I suspect that the input and desired output data I initially put in wasn\'t exactly the same as I what I have with respect to whitespace. I\'
You don't really want to load the input data into memory, because it's so large. Instead, a streaming approach will be faster, and for this awk is well suited:
#!/usr/bin/awk -f
BEGIN {
FS = "\t";
OFS = FS;
}
NR == 1 {
# collect sample names
for (i=1; i <= NF; i++) {
sample[i] = $i
}
}
NR == 2 {
# first four columns are always the same
cols[1] = 1
cols[2] = 3
cols[3] = 4
cols[4] = 5
printf "%s %s %s %s ", sample[1], $3, $4, $5
# dynamic columns (in practice: 2,6,10,...)
for (i=1; i <= NF; i++) {
if ($i == "Beta_value") {
cols[length(cols)+1] = i
printf "%s ", sample[i]
}
}
printf "\n"
}
NR >= 3 {
# print cols from data row
for (i=1; i <= length(cols); i++) {
printf "%s ", $cols[i]
}
printf "\n"
}
This gives your desired output. If you want more speed, you might consider using awk simply to print the column numbers (which only requires reading the two header rows), then cut to actually print them. This will be faster because no interpreted code needs to run for each data row. For the sample data in the question, the cut command you need to print all the data rows is something like this:
cut -d '\t' -f 1,3,4,5,2,6