How does List::Util 'shuffle' actually work?

允我心安 提交于 2021-02-18 12:33:04

问题


I am currently working on building a classifier using c5.0. I have a dataset of 8000 entries and each entry has its own i.d number (1-8000). When testing the performance of the classifier I had to make 5sets of 10:90 (training data: test data) splits. Of course any training cases cannot appear again in the test cases, and duplicates cannot occur in either set.

To solve the problem of picking examples at random for the training data, and making sure the same cannot be picked for the test data I have developed a horribly slow method;

  • fill a file with numbers from 1-8000 on separate lines.

  • randomly pick a line number (from a range of 1-8000) and use the contents of the line as the id number of the training example.

  • write all unpicked numbers to a new file

  • decrement the range of the random number generator by 1

  • redo

Then all unpicked numbers are used as test data. It works but its slow. To speed things up I could use List::Util 'shuffle' to just 'randomly' shuffle and array of these numbers. But how random is 'shuffle'? It is essential that the same level of accuracy is maintained. Sorry about the essay, but does anyone know how 'shuffle' actually works. Any help at all would be great


回答1:


Here is the shuffle algorithm used in List::Util::PP

sub shuffle (@) {
  my @a=\(@_);
  my $n;
  my $i=@_;
  map {
    $n = rand($i--);
    (${$a[$n]}, $a[$n] = $a[$i])[0];
  } @_;
}

Which looks like a Fisher-Yates shuffle.



来源:https://stackoverflow.com/questions/5168104/how-does-listutil-shuffle-actually-work

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!