Hash key as array to obtain the minimum and maximum numbers if a columns values are equal

我怕爱的太早我们不能终老 提交于 2019-12-01 14:30:57

Perl script:

use strict;

# Not shown... Parse the data file, stuff into an array of arrays.

my @data = (
    [ 'miRNA17', 70, 105, 'dvex699824',  12, 233 ],
    [ 'miRNA17', 21,  60, 'dvex699824',  42,  20 ],
    [ 'miRNA17', 55,  89, 'dvex699824',   6,  40 ],
    [ 'miRNA18', 58,  85, 'dvex701176', 119,  92 ],
    [ 'miRNA17', 66, 105, 'dvex703815',  35,  75 ],
    [ 'miRNA17', 31,  71, 'dvex703815',  43,  83 ],
    [ 'miRNA17', 39,  79, 'dvex703815',  43,  83 ],
    [ 'miRNA2',  28,  56, 'dvex731981', 313, 286 ],
    [ 'miRNA17', 10,  70, 'dvex735428', 142, 203 ],
    [ 'miRNA17', 29,  91, 'dvex735428', 213, 152 ],
    [ 'miRNA17', 66, 105, 'dvex735668', 163, 125 ]
);

my %results;

foreach my $record (@data) {
    my ($mirna, $col2, $col3, $dvex, $col5, $col6) = @$record;
    $results{$mirna}{$dvex}{col2} = $col2; # don't care.
    $results{$mirna}{$dvex}{col3} = $col3; # don't care.
    $results{$mirna}{$dvex}{col5} = $col5
        if not $results{$mirna}{$dvex}{col5} or $results{$mirna}{$dvex}{col5} > $col5;
    $results{$mirna}{$dvex}{col6} = $col6
        if not $results{$mirna}{$dvex}{col6} or $results{$mirna}{$dvex}{col6} < $col6;      
}


foreach my $mirna (keys %results) {
    foreach my $dvex (sort keys %{$results{$mirna}}) {
        printf "%-8s  %5d  %5d  %-10s  %3d %3d\n",
            $mirna, $results{$mirna}{$dvex}{col2}, $results{$mirna}{$dvex}{col3},
            $dvex, $results{$mirna}{$dvex}{col5}, $results{$mirna}{$dvex}{col6};
    }
}

1;

Output:

miRNA2       28     56  dvex731981  313 286
miRNA17      55     89  dvex699824    6 233
miRNA17      39     79  dvex703815   35  83
miRNA17      29     91  dvex735428  142 203
miRNA17      66    105  dvex735668  163 125
miRNA18      58     85  dvex701176  119  92

It is much more efficient to process files sequentially, without loading everything into a big array first, whenever possible. Here is what such a solution could look like:

my @output_line = split / /, <IN_FILE>;
while (<IN_FILE>)
{
    my @current_line = split / /, $_;

    if ($current_line[0] ne $output_line[0])
    {
         printf OUT_FILE "%-8s  %5d  %5d  %-10s  %3d %3d\n", @output_line;
         @output_line = @current_line;
    }
    else
    {
         $output_line[1] = $current_line[1] if ($current_line[1] < $output_line[1]);
         $output_line[2] = $current_line[2] if ($current_line[2] > $output_line[2]);
         $output_line[4] = $current_line[4] if ($current_line[4] < $output_line[4]);
         $output_line[5] = $current_line[5] if ($current_line[5] > $output_line[5]);
    }
}

printf OUT_FILE "%-8s  %5d  %5d  %-10s  %3d %3d\n", @output_line;

Caveat: your question stated that the output lines should have the "same dvex####". However, your sample output didn't show this. Thus I ignored that requirement. However, you could easily introduce that requirement simply by putting another condition in the if statement.

Second Caveat: this approach also requires lines that will be grouped to be next to each other, as they were in your sample data.

This is a simple script that will produce the output you wanted, though it does more than your requirements state as it also checks min/max for columns 2 and 3.

I am using List::Util to get the min/max values, which is purely convenience. The module is core since v5.7.3, so it should not present a problem. Using Text::CSV is prudent, but might not be required, depending on your data. Assuming no non-tab whitespace in your columns, one can get away with using split, which will remove the module dependency.

use strict;
use warnings;
use Text::CSV;
use List::Util qw(min max);

my $csv = Text::CSV->new({
    sep_char => "\t",   
    eol => $/,      # required for $csv->print
    binary  => 1,});

my %data;
my @order;

# *DATA and *STDOUT represent file handles, and can be replaced with
# any other file handle as you require. DATA is used here for simplicity.
#
while (my $row = $csv->getline(*DATA)) {
    my ($mir, $dv) = @{$row}[0,3];

    my $field = "$mir/$dv";
    unless (defined $data{$field}) { # new fields are stored as-is
        push @order, $field;         # preserving original order of input
        $data{$field} = $row;
        next;
    }
    $data{$field}[1] = min($data{$field}[1], $row->[1]);
    $data{$field}[2] = max($data{$field}[2], $row->[2]);
    $data{$field}[4] = min($data{$field}[4], $row->[4]);
    $data{$field}[5] = max($data{$field}[5], $row->[5]);
}

for my $field (@order) {
    $csv->print(*STDOUT, $data{$field});
}


__DATA__
miRNA17 70  105 dvex699824  12  233
miRNA17 21  60  dvex699824  42  20
miRNA17 55  89  dvex699824  6   40
miRNA18 58  85  dvex701176  119 92
miRNA17 66  105 dvex703815  35  75
miRNA17 31  71  dvex703815  43  83
miRNA17 39  79  dvex703815  43  83
miRNA2  28  56  dvex731981  313 286
miRNA17 10  70  dvex735428  142 203
miRNA17 29  91  dvex735428  213 152
miRNA17 66  105 dvex735668  163 125

Output:

miRNA17 21      105     dvex699824      6       233
miRNA18 58      85      dvex701176      119     92
miRNA17 31      105     dvex703815      35      83
miRNA2  28      56      dvex731981      313     286
miRNA17 10      91      dvex735428      142     203
miRNA17 66      105     dvex735668      163     125

Note that the output does not match yours, because you have failed to distinguish between the dvex numbers for the last line of your sample input.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!