pandas randomly replace k percent

前端 未结 3 939
轮回少年
轮回少年 2021-01-05 08:21

having a simple pandas data frame with 2 columns e.g. id and value where value is either 0 or 1 I would like

相关标签:
3条回答
  • 2021-01-05 09:06

    you can probably use numpy.random.choice:

    >>> idx = df.index[df.value==1]
    >>> df.loc[np.random.choice(idx, size=idx.size/10, replace=False)].value = 0
    
    0 讨论(0)
  • 2021-01-05 09:10

    pandas answer

    • use query to get filtered df with only value == 1
    • use sample(frac=.1) to take 10% of those
    • use the index of the result to assign zero

    df.loc[
        df.query('value == 1').sample(frac=.1).index,
        'value'
    ] = 0
    

    alternative numpy answer

    • get boolean array of where df['value'] is 1
    • assign random array of 10% zeros and 90% ones

    v = df.value.values == 1
    df.loc[v, 'value'] = np.random.choice((0, 1), v.sum(), p=(.1, .9))
    
    0 讨论(0)
  • 2021-01-05 09:10

    Here's a NumPy approach with np.random.choice -

    a = df.value.values  # get a view into value col
    idx = np.flatnonzero(a) # get the nonzero indices
    
    # Finally select unique 10% from those indices and set 0s there
    a[np.random.choice(idx,size=int(0.1*len(idx)),replace=0)] = 0
    

    Sample run -

    In [237]: df = pd.DataFrame(np.random.randint(0,2,(100,2)),columns=['id','value'])
    
    In [238]: (df.value==1).sum() # Original Count of 1s in df.value column
    Out[238]: 53
    
    In [239]: a = df.value.values
    
    In [240]: idx = np.flatnonzero(a)
    
    In [241]: a[np.random.choice(idx,size=int(0.1*len(idx)),replace=0)] = 0
    
    In [242]: (df.value==1).sum() # New count of 1s in df.value column
    Out[242]: 48
    

    Alternatively, a bit more pandas approach -

    idx = np.flatnonzero(df['value'])
    df.ix[np.random.choice(idx,size=int(0.1*len(idx)),replace=0),'value'] = 0
    

    Runtime test

    All approaches posted thus far -

    def f1(df):  #@piRSquared's soln1
        df.loc[df.query('value == 1').sample(frac=.1).index,'value'] = 0
    
    def f2(df):  #@piRSquared's soln2
        v = df.value.values == 1
        df.loc[v, 'value'] = np.random.choice((0, 1), v.sum(), p=(.1, .9))
    
    def f3(df): #@Roman Pekar's soln
        idx = df.index[df.value==1]
        df.loc[np.random.choice(idx, size=idx.size/10, replace=False)].value = 0
    
    def f4(df): #@Mine soln1
        a = df.value.values
        idx = np.flatnonzero(a)
        a[np.random.choice(idx,size=int(0.1*len(idx)),replace=0)] = 0
    
    def f5(df): #@Mine soln2
        idx = np.flatnonzero(df['value'])
        df.ix[np.random.choice(idx,size=int(0.1*len(idx)),replace=0),'value'] = 0
    

    Timings -

    In [2]: # Setup inputs
       ...: df = pd.DataFrame(np.random.randint(0,2,(10000,2)),columns=['id','value'])
       ...: df1 = df.copy()
       ...: df2 = df.copy()
       ...: df3 = df.copy()
       ...: df4 = df.copy()
       ...: df5 = df.copy()
       ...: 
    
    In [3]: # Timings
       ...: %timeit f1(df1)
       ...: %timeit f2(df2)
       ...: %timeit f3(df3)
       ...: %timeit f4(df4)
       ...: %timeit f5(df5)
       ...: 
    100 loops, best of 3: 3.96 ms per loop
    1000 loops, best of 3: 844 µs per loop
    1000 loops, best of 3: 1.62 ms per loop
    10000 loops, best of 3: 163 µs per loop
    1000 loops, best of 3: 663 µs per loop
    
    0 讨论(0)
提交回复
热议问题