Perl: Programming Efficiency when computing correlation coefficients for a large set of data

问题

EDIT: Link should work now, sorry for the trouble

I have a text file that looks like this:

Name, Test 1, Test 2, Test 3, Test 4, Test 5
Bob, 86, 83, 86, 80, 23
Alice, 38, 90, 100, 53, 32
Jill, 49, 53, 63, 43, 23.

I am writing a program that given this text file, it will generate a Pearson's correlation coefficient table that looks like this where the entry (x,y) is the correlation between person x and person y:

Name,Bob,Alice,Jill
Bob, 1, 0.567088412588577, 0.899798494392584
Alice, 0.567088412588577, 1, 0.812425393004088
Jill, 0.899798494392584, 0.812425393004088, 1

My program works, except that the data set I am feeding it has 82 columns and, more importantly, 54000 rows. When I run my program right now, it is incredibly slow and I get an out of memory error. Is there a way I can first of all, remove any possibility of an out of memory error and maybe make the program run a little more efficiently? The code is here: code.

Thanks for your help,
Jack

Edit: In case anyone else is trying to do large scale computation, convert your data into hdf5 format. This is what I ended up doing to solve this issue.

回答1:

You're going to have to do at least 54000^2*82 calculations and comparisons. Of course it's going to take a lot of time. Are you holding everything in memory? That's going to be pretty large too. It will be slower, but it might use less memory if you can keep the users in a database and calculate one user against all the others, then go on to the next and do it against all the others instead of one massive array or hash.

回答2:

Have a look at Tie::File to deal with the high memory usage of having your input and output files stored in memory.

回答3:

Have you searched CPAN? My own search yielded another method gsl_stats_correlation for computing Pearsons correlation. This one is in Math::GSL::Statisics. This module binds to the GNU Scientific Library.

gsl_stats_correlation($data1, $stride1, $data2, $stride2, $n) - This function efficiently computes the Pearson correlation coefficient between the array reference $data1 and $data2 which must both be of the same length $n. r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y) = {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y) \over \sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2} }

回答4:

You may want to look at PDL:

PDL ("Perl Data Language") gives standard Perl the ability to compactly store and speedily manipulate the large N-dimensional data arrays which are the bread and butter of scientific computing

回答5:

Essentially Paul Tomblin has given you the answer: It's a lot of calculation so it will take a long time. It's a lot of data, so it will take a lot of memory.

However, there may be one gotcha: If you use perl 5.10.0, your list assignments at the start of each method may be victims of a subtle performance bug in that version of perl (cf. perlmonks thread).

A couple of minor points:

The printout may actually slow down the program somewhat depending one where it goes.

There is no need to reopen the output file for each line! Just do something like this:

open FILE, ">", "file.txt" or die $!;
print FILE "Name, ", join(", ", 0..$#{$correlations[0]}+1), "\n";
my $rowno = 1;
foreach my $row (@correlations) {
  print FILE "$rowno, " . join(", ", @$row) . "\n";
  $rowno++;
}
close FILE;

Finally, while I do use Perl whenever I can, with a program and data set such as you describe, it might be the simplest route to simply use C++ with its iostreams (which make parsing easy enough) for this task.

Note that all of this is just minor optimization. There's no algorithmic gain.

回答6:

I don't know enough about what you are trying to do to give good advice about implementation, but you might look at Statistics::LSNoHistory, it claims to have a method pearson_r that returns Pearson's r correlation coefficient.

回答7:

Further to the comment above about PDL, here is the code how to calculate the correlation table even for very big datasets quite efficiently:

use PDL::Stats; # this useful module can be downloaded from CPAN
my $data = random(82, 5400); # your data should replace this
my $table = $data->corr_table(); # that's all, really

You might need to set $PDL::BIGPDL = 1; in the header of your script and make sure you run this on a machine with A LOT of memory. The computation itself is reasonably fast, a 82 x 5400 table took only a few seconds on my laptop.

来源：https://stackoverflow.com/questions/675021/perl-programming-efficiency-when-computing-correlation-coefficients-for-a-large

标签

perl

memory

performance

dataset