Read random lines from huge CSV file in Python

前端 未结 11 1479
独厮守ぢ
独厮守ぢ 2020-12-05 02:27

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows t

11条回答
  •  忘掉有多难
    2020-12-05 03:06

    I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it

    Assuming you don't need exactly 1 million lines and know then number of lines in your CSV file beforehand, you can use reservoir sampling to retrieve your random subset. Simply iterate through your data and for each line determine the chances of the line being selected. That way you only need a single pass of your data.

    This works well if you need to extract the random samples often but the actual dataset changes infrequently (since you'll only need to keep track of the number of entries each time the dataset changes).

    chances_selected = desired_num_results / total_entries
    for line in csv.reader(file):
       if random() < chances_selected:
            result.append(line)
    

提交回复
热议问题