Random Sample of a subset of a dataframe in Pandas

后端 未结 3 2072
北荒
北荒 2020-12-10 23:58

Say i have a dataframe with 100,000 entries and want to split it into 100 sections of 1000 entries.

How do i take a random sample of say size 50 of just one of the

相关标签:
3条回答
  • 2020-12-11 00:34

    This is a nice place for recursion.

    def main2():
        rows = 8  # say you have 8 rows, real data will need len(rows) for int
        rands = []
        for i in range(rows):
            gen = fun(rands)
            rands.append(gen)
        print(rands)  # now range through random values
    
    
    def fun(rands):
        gen = np.random.randint(0, 8)
        if gen in rands:
            a = fun(rands)
            return a
        else: return gen
    
    
    if __name__ == "__main__":
        main2()
    

    output: [6, 0, 7, 1, 3, 5, 4, 2]

    0 讨论(0)
  • 2020-12-11 00:39

    One solution is to use the choice function from numpy.

    Say you want 50 entries out of 100, you can use:

    import numpy as np
    chosen_idx = np.random.choice(1000, replace=False, size=50)
    df_trimmed = df.iloc[chosen_idx]
    

    This is of course not considering your block structure. If you want a 50 item sample from block i for example, you can do:

    import numpy as np
    block_start_idx = 1000 * i
    chosen_idx = np.random.choice(1000, replace=False, size=50)
    df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]
    
    0 讨论(0)
  • 2020-12-11 00:39

    You can use the sample method*:

    In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])
    
    In [12]: df.sample(2)
    Out[12]:
       A  B
    0  1  2
    2  5  6
    
    In [13]: df.sample(2)
    Out[13]:
       A  B
    3  7  8
    0  1  2
    

    *On one of the section DataFrames.

    Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.

    In [14]: df.sample(5)
    ValueError: Cannot take a larger sample than population when 'replace=False'
    
    In [15]: df.sample(5, replace=True)
    Out[15]:
       A  B
    0  1  2
    1  3  4
    2  5  6
    3  7  8
    1  3  4
    
    0 讨论(0)
提交回复
热议问题