Randomly selecting lines from files

后端 未结 7 755
轮回少年
轮回少年 2021-01-13 14:21

I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files. How can

7条回答
  •  半阙折子戏
    2021-01-13 15:11

    If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.

    Here's an intuitive explanation for the algorithm.

    Process the file line by line.
    pick = line, with probability 1/N, where N = line number
    

    In other words, on line 1, we will pick line 1 with 1/1 probability. On line 2, we will change the pick to line 2, with 1/2 probability. On line 3, we will change the pick to line 3, with 1/3 probability. Etc.

    For an intuitive proof, imagine a file with 3 lines:

            1            Pick line 1.
           / \
         .5  .5
         /     \
        2       1        Switch to line 2?
       / \     / \
     .67 .33 .33 .67
     /     \ /     \
    2       3       1    Switch to line 3?
    

    The probability for each outcome:

    Line 1: .5 * .67     = 1/3
    Line 2: .5 * .67     = 1/3
    Line 3: .5 * .33 * 2 = 1/3
    

    From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4 chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4).

    Here's the Perl FAQ answer, adapted to your problem.

    use strict;
    use warnings;
    
    # Ignore 5 lines.
    <> for 1 .. 5;
    
    # Use reservoir sampling to select pairs from remaining lines.
    my (@picks, $n);
    until (eof){
        my @lines;
        $lines[$_] = <> for 0 .. 1;
    
        $n ++;
        @picks = @lines if rand($n) < 1;
    }
    
    print @picks;
    

提交回复
热议问题