Append a new column to file in perl

问题

I've got the follow function inside a perl script:

sub fileSize {
   my $file = shift;
   my $opt = shift;
   open (FILE, $file) or die "Could not open file $file: $!";
   $/ = ">";
   my $junk = <FILE>;
   my $g_size = 0;   
   while ( my $rec = <FILE> ) {
      chomp $rec; 
      my ($name, @seqLines) = split /\n/, $rec;
       my $sec = join('',@seqLines);
      $g_size+=length($sec);
      if ( $opt == 1 ) {
        open TMP, ">>", "tmp" or die "Could not open chr_sizes.log: $!\n";
        print TMP "$name\t", length($sec), "\n";
      }
   }
   if ( $opt == 0 ) {
      PrintLog( "file_size: $g_size", 0 );
   }
   else {
      print TMP "file_size: $g_size\n";
      close TMP;
   }
   $/ = "\n";
   close FILE;
}

Input file format:

>one
AAAAA
>two
BBB
>three
C

I have several input files with that format. The line beginning with ">" is the same but the other lines can be of different length. The output of the function with only one file is:

one 5
two 3
three   1

I want to execute the function in a loop with this for each file:

foreach my $file ( @refs ) {
   fileSize( $file, 1 );
}

When running the next iteration, let's say with this file:

>one
AAAAABB
>two
BBBVFVF
>three
CS

I'd like to obtain this output:

one 5 7
two 3 7
three 1 2

How can I modify the function or modify the script to get this? As can be seen, my function append the text to the file

Thanks!

回答1:

I've left out your options and the file IO operations and have concentrated on showing a way to do this with an array of arrays from the command line. I hope it helps. I'll leave wiring it up to your own script and subroutines mostly up to to you :-)

Running this one liner against your first data file:

perl -lne ' $name = s/>//r if /^>/ ; 
   push @strings , [$name, length $_] if !/^>/ ;
   END { print "@{$_ } " for @strings }' datafile1.txt

gives this output:

one 5 
two 3 
three 1

Substituting the second version or instance of the data file (i.e where record one contains AAAAABB) gives the expected results as well.

one 7 
two 7 
three 2

In your script above, you save to an output file in this format. So, to append columns to each row in your output file, we can just munge each of your data files in the same way (with any luck this might mean things can be converted into a function that will work in a foreach loop). If we save the transformed data to be output into an array of arrays (AoA), then we can just push the length values we get for each data file string onto the corresponding anonymous array element and then print out the array. Voilà! Now let's hope it works ;-)

You might want to install Data::Printer which can be used from the command line as -MDDP to visualize data structures.

First - run the above script and redirect the output to a file with > /tmp/output.txt

Next - try this longish one-liner that uses DDP and p to show the structure of the array we create:

perl -MDDP -lne 'BEGIN{ local @ARGV=shift; 
 @tmp = map { [split] } <>; p @tmp } 
 $name = s/>//r if /^>/ ; 
 push @out , [ $name, length $_ ] if !/^>/ ;
 END{ p @out ; }' /tmp/output.txt datafile2.txt `

In the BEGIN block we local-ize @ARGV ; shift off the first file (our version of your TMP file) - {local @ARGV=shift} is almost a perl idiom for handling multiple input files; we then split it inside an anonymous array constructor ([]) and map { } that into the @tmp array which we display with DDP's p() function. Once we are out of the BEGIN block, the implicit while (<>){ ... } that we get with perl's -n command line switch takes over and reads in the remaining file from @ARGV ; we process lines starting with > - stripping the leading character and assigning the string that follows to the $name variable; the while continues and we push $name and the length of any line that does not start with > (if !/^>/) wrapped as elements of an anonymous array [] into the @out array which we display with p() as well (in the END{} block so it doesn't print inside our implicit while() loop). Phew!!

See the AoA that results as a gist @Github.

Finally - building on that, and now we have munged things nicely - we can change a few things in our END{...} block (add a nested for loop to push things around) and put this all together to produce the output we want.

This one liner:

perl -MDDP -lne 'BEGIN{ local @ARGV=shift; @tmp = map {[split]} <>; }
$name = s/>//r if /^>/ ; push @out, [ $name, length $_ ] if !/^>/ ;
END{ foreach $row (0..$#tmp) { push $tmp[$row] , $out[$row][-1]} ; 
   print "@$_" for @tmp }'  output.txt datafile2.txt

produces:

one 5 7
two 3 7
three 1 2

We'll have to convert that into a script :-)

The script consists of three rather wordy subroutines that reads the log file; parses the datafile ; merges them. We run them in order. The first one checks to see if there is an existing log and creates one and then does an exit to skip any further parsing/merging steps.

You should be able to wrap them in a loop of some kind that feeds files to the subroutines from an array instead of fetching them from STDIN. One caution - I'm using IO::All because it's fun and easy!

use 5.14.0 ;          
use IO::All;    
my @file = io(shift)->slurp ;          
my  $log = "output.txt" ; 

&readlog;         
&parsedatafile;  
&mergetolog;   

####### subs ####### 
sub readlog {
   if (! -R $log) {
     print "creating first log entry\n";
     my @newlog = &parsedatafile ;  
     open(my $fh, '>', $log) or die "I CAN HAZ WHA????" ;  
     print $fh "@$_ \n" for @newlog ;
     exit;
   }
   else {
     map { [split] } io($log)->slurp ;
   }
}

sub parsedatafile {   
  my (@out, $name) ;     
  while (<@file>) {   
    chomp ;       
    $name = s/>//r if /^>/;   
    push @out, [$name, length $_] if !/^>/ ;   
  } 
  @out;       
} 

sub mergetolog {   
  my @tmp = readlog ;     
  my @data = parsedatafile ;  
  foreach my $row (0 .. $#tmp) { 
    push $tmp[$row], $data[$row][-1]  
  }        
  open(my $fh, '>', $log) or die "Foobar!!!" ; 
  print $fh "@$_ \n" for @tmp ;  
}

The subroutines do all the work here - you can likely find ways to shorten; combine; improve them. Is this a useful approach for you?

I hope this explanation is clear and useful to someone - corrections and comments welcome. Probably the same thing could be done with place editing (i.e with perl -pie '...') which is left as an exercise to those that follow ...

回答2:

You need to open the output file itself. First in read mode, then in write mode.
I have written a script that does what you are asking. What really matters is the part that appends new data to old data. Adapt that to your fileSize function.

So you have the output file, output.txt

Of the form,

one 5
two 3
three 1

And an array of input files, input1.txt, input2.txt, etc, saved in the @inputfiles variable.

Of the form,

>one
AAAAA
>two
BBB
>three
C
>four
DAS

and

>one
AAAAABB
>two
BBBVFVF
>three
CS

Respectively.

After running the following perl script,

# First read previous output file.
open OUT, '<', "output.txt" or die $!;
my @outlines;
while (my $line = <OUT> ) {
    chomp $line;
    push @outlines, $line;
}
close OUT;
my $outsize = scalar @outlines;

# Suppose you have your array of input file names already prepared
my @inputfiles = ("input1.txt", "input2.txt");
foreach my $file (@inputfiles) {
    open IN, '<', $file or die $!;
    my $counter = 1; # Used to compare against output size
    while (my $line = <IN>) {
        chomp $line;
        $line =~ m/^>(.*)$/;
        my $name = $1;

        my $sequence =  <IN>;
        chomp $sequence;
        my $seqsize = length($sequence);

        # Here is where I append a column to output data.
        if($counter <= $outsize) {
            $outlines[$counter - 1] .= " $seqsize";
        } else {
            $outlines[$counter - 1] = "$name $seqsize";
        }
        $counter++;
    }
    close IN;
}

# Now rewrite the results to output.txt
open OUT, '>', "output.txt" or die $!;
foreach (@outlines) {
    print OUT "$_\n";
}
close OUT;

You generate the output,

one 5 5 7
two 3 3 7
three 1 1 2
four 3

来源：https://stackoverflow.com/questions/26890652/append-a-new-column-to-file-in-perl

标签

perl

subroutine