Read a small random sample from a big CSV file into a Python data frame

后端 未结 13 1968
暖寄归人
暖寄归人 2020-11-27 02:37

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

13条回答
  •  余生分开走
    2020-11-27 03:23

    Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.

    Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.

    By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.

    See code below:

    import random
    
    n_samples = 10
    samples = []
    
    for i, line in enumerate(f):
        if i < n_samples:
            samples.append(line)
        elif random.random() < n_samples * 1. / (i+1):
                samples[random.randint(0, n_samples-1)] = line
    

提交回复
热议问题