How to selectively remove columns and rows with bash or python

后端未结

关注

 1  2002

独厮守ぢ 2021-01-16 15:10

UPDATE

I suspect that the input and desired output data I initially put in wasn\'t exactly the same as I what I have with respect to whitespace. I\'

1条回答

情歌与酒 (楼主)

2021-01-16 15:45

You don't really want to load the input data into memory, because it's so large. Instead, a streaming approach will be faster, and for this awk is well suited:

#!/usr/bin/awk -f

BEGIN {
    FS = "\t";
    OFS = FS;
}

NR == 1 {
    # collect sample names                                                                                                                                                               
    for (i=1; i <= NF; i++) {
        sample[i] = $i
    }
}

NR == 2 {
    # first four columns are always the same                                                                                                                                             
    cols[1] = 1
    cols[2] = 3
    cols[3] = 4
    cols[4] = 5
    printf "%s %s %s %s ", sample[1], $3, $4, $5

    # dynamic columns (in practice: 2,6,10,...)                                                                                                                                          
    for (i=1; i <= NF; i++) {
        if ($i == "Beta_value") {
            cols[length(cols)+1] = i
            printf "%s ", sample[i]
        }
    }
    printf "\n"
}

NR >= 3 {
    # print cols from data row                                                                                                                                                           
    for (i=1; i <= length(cols); i++) {
        printf "%s ", $cols[i]
    }
    printf "\n"
}

This gives your desired output. If you want more speed, you might consider using awk simply to print the column numbers (which only requires reading the two header rows), then cut to actually print them. This will be faster because no interpreted code needs to run for each data row. For the sample data in the question, the cut command you need to print all the data rows is something like this:

cut -d '\t' -f 1,3,4,5,2,6

0 讨论(0)