Read a small random sample from a big CSV file into a Python data frame

后端 未结 13 1962
暖寄归人
暖寄归人 2020-11-27 02:37

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

13条回答
  •  粉色の甜心
    2020-11-27 03:21

    No pandas!

    import random
    from os import fstat
    from sys import exit
    
    f = open('/usr/share/dict/words')
    
    # Number of lines to be read
    lines_to_read = 100
    
    # Minimum and maximum bytes that will be randomly skipped
    min_bytes_to_skip = 10000
    max_bytes_to_skip = 1000000
    
    def is_EOF():
        return f.tell() >= fstat(f.fileno()).st_size
    
    # To accumulate the read lines
    sampled_lines = []
    
    for n in xrange(lines_to_read):
        bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
        f.seek(bytes_to_skip, 1)
        # After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
        # Skip current entire line
        f.readline()
        if not is_EOF():
            sampled_lines.append(f.readline())
        else:
            # Go to the begginig of the file ...
            f.seek(0, 0)
            # ... and skip lines again
            f.seek(bytes_to_skip, 1)
            # If it has reached the EOF again
            if is_EOF():
                print "You have skipped more lines than your file has"
                print "Reduce the values of:"
                print "   min_bytes_to_skip"
                print "   max_bytes_to_skip"
                exit(1)
            else:
                f.readline()
                sampled_lines.append(f.readline())
    
    print sampled_lines
    

    You'll end up with a sampled_lines list. What kind of statistics do you mean?

提交回复
热议问题