I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows t
I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it
Assuming you don't need exactly 1 million lines and know then number of lines in your CSV file beforehand, you can use reservoir sampling to retrieve your random subset. Simply iterate through your data and for each line determine the chances of the line being selected. That way you only need a single pass of your data.
This works well if you need to extract the random samples often but the actual dataset changes infrequently (since you'll only need to keep track of the number of entries each time the dataset changes).
chances_selected = desired_num_results / total_entries
for line in csv.reader(file):
if random() < chances_selected:
result.append(line)