Read a small random sample from a big CSV file into a Python data frame

后端 未结 13 1984
暖寄归人
暖寄归人 2020-11-27 02:37

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

13条回答
  •  再見小時候
    2020-11-27 03:37

    @dlm's answer is great but since v0.20.0, skiprows does accept a callable. The callable receives as an argument the row number.

    If you can specify what percent of lines you want, rather than how many lines, you don't even need to get the file size and you just need to read through the file once. Assuming a header on the first row:

    import pandas as pd
    import random
    p = 0.01  # 1% of the lines
    # keep the header, then take only 1% of lines
    # if random from [0,1] interval is greater than 0.01 the row will be skipped
    df = pd.read_csv(
             filename,
             header=0, 
             skiprows=lambda i: i>0 and random.random() > p
    )
    

    Or, if you want to take every nth line:

    n = 100  # every 100th line = 1% of the lines
    df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)
    

提交回复
热议问题