I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it.
As far as I can see - and implement - the CSV utility in Python only allows t
# pass 1, count the number of rows in the file
rowcount = sum(1 for line in file)
# pass 2, select random lines
file.seek(0)
remaining = 1000000
for row in csv.reader(file):
if random.randrange(rowcount) < remaining:
print row
remaining -= 1
rowcount -= 1