I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows t
If you want to grab random lines many times (e.g., mini-batches for machine learning), and you don't mind scanning through the huge file once (without loading it into memory), then you can create a list of line indeces and use seek to quickly grab the lines (based off of Maria Zverina's answer).
# Overhead:
# Read the line locations into memory once. (If the lines are long,
# this should take substantially less memory than the file itself.)
fname = 'big_file'
s = [0]
linelocs = [s.append(s[0]+len(n)) or s.pop(0) for n in open(fname)]
f = open(fname) # Reopen the file.
# Each subsequent iteration uses only the code below:
# Grab a 1,000,000 line sample
# I sorted these because I assume the seeks are faster that way.
chosen = sorted(random.sample(linelocs, 1000000))
sampleLines = []
for offset in chosen:
f.seek(offset)
sampleLines.append(f.readline())
# Now we can randomize if need be.
random.shuffle(sampleLines)