How to replace all non-NaN entries of a dataframe with 1 and all NaN with 0

后端 未结 9 1030
轻奢々
轻奢々 2021-02-01 18:07

I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.

Initially I tried for-loop on each value of th

9条回答
  •  情深已故
    2021-02-01 18:43

    I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.

    I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.

    from __future__ import division, print_function
    
    import numpy as np
    import pandas as pd
    import datetime as dt
    
    
    # create dataframe with randomly place NaN's
    data = np.ones( (1e2,1e2) )
    data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan
    
    df = pd.DataFrame(data=data)
    
    trials = np.arange(100)
    
    
    d1 = dt.datetime.now()
    
    for r in trials:
        new_df = df.notnull().astype(int)
    
    print( (dt.datetime.now()-d1).total_seconds()/trials.size )
    
    
    # create a dummy copy of df.  I use a dummy copy here to prevent biasing the 
    # time trial with dataframe copies/creations within the upcoming loop
    df_dummy = df.copy()
    
    d1 = dt.datetime.now()
    
    for r in trials:
        df_dummy[df.isnull()] = 0
        df_dummy[df.isnull()==False] = 1
    
    print( (dt.datetime.now()-d1).total_seconds()/trials.size )
    

    This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.

提交回复
热议问题