Read random lines from huge CSV file in Python

前端 未结 11 1475
独厮守ぢ
独厮守ぢ 2020-12-05 02:27

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows t

11条回答
  •  谎友^
    谎友^ (楼主)
    2020-12-05 03:03

    import random
    
    filesize = 1500                 #size of the really big file
    offset = random.randrange(filesize)
    
    f = open('really_big_file')
    f.seek(offset)                  #go to random position
    f.readline()                    # discard - bound to be partial line
    random_line = f.readline()      # bingo!
    
    # extra to handle last/first line edge cases
    if len(random_line) == 0:       # we have hit the end
        f.seek(0)
        random_line = f.readline()  # so we'll grab the first line instead
    

    As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:

    Let's assume (in this case) we have min=3 and max=15

    1) Find the length (Lp) of the previous line.

    Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.

    We accomplish this by randomly keeping the line X% of the time where:

    X = min / Lp

    If we don't keep the line, we do another random pick until our dice roll comes good. :-)

提交回复
热议问题