Read a small random sample from a big CSV file into a Python data frame

后端 未结 13 2004
暖寄归人
暖寄归人 2020-11-27 02:37

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

13条回答
  •  生来不讨喜
    2020-11-27 03:24

    The following code reads first the header, and then a random sample on the other lines:

    import pandas as pd
    import numpy as np
    
    filename = 'hugedatafile.csv'
    nlinesfile = 10000000
    nlinesrandomsample = 10000
    lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
    df = pd.read_csv(filename, skiprows=lines2skip)
    

提交回复
热议问题