Read random lines from huge CSV file in Python

前端 未结 11 1462
独厮守ぢ
独厮守ぢ 2020-12-05 02:27

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows t

11条回答
  •  北海茫月
    2020-12-05 02:49

    If you want to grab random lines many times (e.g., mini-batches for machine learning), and you don't mind scanning through the huge file once (without loading it into memory), then you can create a list of line indeces and use seek to quickly grab the lines (based off of Maria Zverina's answer).

    # Overhead:
    # Read the line locations into memory once.  (If the lines are long,
    # this should take substantially less memory than the file itself.)
    fname = 'big_file'
    s = [0]
    linelocs = [s.append(s[0]+len(n)) or s.pop(0) for n in open(fname)]
    f = open(fname) # Reopen the file.
    
    # Each subsequent iteration uses only the code below:
    # Grab a 1,000,000 line sample
    # I sorted these because I assume the seeks are faster that way.
    chosen = sorted(random.sample(linelocs, 1000000))
    sampleLines = []
    for offset in chosen:
      f.seek(offset)
      sampleLines.append(f.readline())
    # Now we can randomize if need be.
    random.shuffle(sampleLines)
    

提交回复
热议问题