variable fillna() in each column

主宰稳场 提交于 2019-12-23 21:30:08

问题


For starters, here is some artificial data fitting my problem:

df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)), 
          columns = ["col_{}".format(x) for x in range(10)], 
          index = range(0, vsize * 3, 3))

df_2 = pd.DataFrame(np.random.randint(0,100,size=(vsize, 10)), 
            columns = ["col_{}".format(x) for x in range(10, 20, 1)], 
            index = range(0, vsize * 2, 2))

df = df.merge(df_2, left_index = True, right_index = True, how = 'outer')

df_tar = pd.DataFrame({"tar_1": [np.random.randint(0, 2) for x in range(vsize * 3)], 
               "tar_2": [np.random.randint(0, 4) for x in range(vsize * 3)], 
               "tar_3": [np.random.randint(0, 8) for x in range(vsize * 3)], 
               "tar_4": [np.random.randint(0, 16) for x in range(vsize * 3)]})

df = df.merge(df_tar, left_index = True, right_index = True, how = 'inner')

Now, I would like to fill NaN values in each column, with a MEDIAN value of non-NaN values in each column, but with noise added to each filled NaN in that column. The MEDIAN value should be calculated for values in that column, which belong to the same class, as marked in column tar_4 at first. Then, if any NaNs persist in the column (because some values in the column were all in tar_4 class which featured only NaNs, so no MEDIAN could be calculated), the same operation is repeated on the updated column (with some NaN's already filled in from tar_4 operation), but with values belonging to the same class relative to tar_3 column. Then tar_2, and tar_1.

The way I imagine it would be as follows:

  • col_1 features e.g. 6 non-Nan & 4 NaN values: [1, 2, NaN, 4, NaN, 12, 5, NaN, 1, NaN]
  • only values [1, 2, NaN, 4, NaN] belong to the same class (e.g. class 1) in tar_4, so they are pushed through NaN filling:
    • NaN value at index [2] gets filled with MEDIAN (=2) + random(-3, 3) * std error of distribution in col_1, e.g. 2 + (1 * 1.24)
    • NaN value at index [4] gets filled with MEDIAN (=2) + random(-3, 3) * std error of distribution in col_1, e.g. 2 + (-2 * 1.24)
  • Now col_1 has the following 8 non-NaN and 2 NaN values: [1, 2, 1.24, 4, -0.48, 12, 5, NaN, 1, NaN]
  • Column col_1 still features some NaN values, so grouping based on common class in tar_3 column is applied:
    • out of [1, 2, 1.24, 4, -0.48, 12, 5, NaN, 1, NaN], values [1, 2, 1.24, 4, -0.48, 12, 5, NaN] are in the same class now, so they get processed:
    • NaN value at index [7] gets assigned MEDIAN of values in indices [0-6] (=2) + random(-3, 3) * std error, e.g. 2 + 2 * 3.86
  • now col_1 has 9 non-NaN values and 1 NaN value: [1, 2, 1.24, 4, -0.48, 12, 5, 9.72, 1, NaN]
    • all values in col_1 belong to the same class based on tar_2 column, so NaN value at index [9] gets processed with the same logic, as described above, and ends up with value 2 * (-1 * 4.05)
  • col_1 now features only non-NaN values: [1, 2, 1.24, 4, -0.48, 12, 5, 9.72, 1, -6.09], and does not need to to be pushed through NaN filling based on tar_1 column.

The same logic goes through the rest of columns.

So, the expected output: DataFrame with filled NaN values, in each column based on decreasing level of granularity of classes based on columns tar_4 - tar_1.

I already have a code, which kind of achieves that, thanks to @Quang Hoang:

def min_max_check(col):
    if ((df[col].dropna() >= 0) & (df[col].dropna() <= 1.0)).all():
        return medians[col]
    elif (df[col].dropna() >= 0).all():
        return medians[col] + round(np.random.randint(low = 0, high = 3) * stds[col], 2)
    else:
        return medians[col] + round(np.random.randint(low = -3, high = 3) * stds[col], 2)


tar_list = ['tar_4', 'tar_3', 'tar_2', 'tar_1']
cols = [col for col in df.columns if col not in tar_list]
# since your dataframe may not have continuous index
idx = df.index

for tar in tar_list:
    medians = df[cols].groupby(by = df[tar]).agg('median')
    std = df[cols].groupby(by = df[tar]).agg(np.std)
    df.set_index(tar, inplace=True)
    for col in cols:
        df[col] = df[col].fillna(min_max_check(col))
    df.reset_index(inplace=True)

df.index = idx

However, this only fills the NaN values with the same MEDIAN value + noise, at each granularity level. How can this code be enhanced to generate varied fill values for each NaN value at e.g. tar_4, tar_3, tar_2 and tar_1 levels?


回答1:


One quick solution is to modify your min_max_check to get_noise at each row:

def gen_noise(col):
    num_row = len(df)

    # generate noise of the same height as our dataset
    # notice the size argument in randint
    if ((df[col].dropna() >= 0) & (df[col].dropna() <= 1.0)).all():
        noise = 0
    elif (df[col].dropna() >= 0).all():
        noise =  np.random.randint(low = 0, 
                                   high = 3, 
                                   size=num_row)
    else:
        noise =  np.random.randint(low = -3, 
                                   high = 3,
                                   size=num_row)

    # multiplication with isna() forces those at non-null values in df[col] to be 0
    return noise * df[col].isna()

And then later:

df.set_index(tar, inplace=True)

for col in cols[:1]:
    noise = gen_noise(col)
    df[col] = (df[col].fillna(medians[col])
                      .add(noise.mul(stds[col]).values)
              )

df.reset_index(inplace=True)

Note: You can modify the code further in the sense that you generate the noise_df with the same size with medians and stds, something like this

for tar in tar_list:
    medians = df[cols].groupby(df[tar]).agg('median')
    stds = df[cols].groupby(df[tar]).agg('std')

    # generate noise_df here
    medians = medians + round(noise_df*std, 2)

    df.set_index(tar, inplace=True)

    for col in cols[:1]:
        df[col] = df[col].fillna(medians[col])    

    df.reset_index(inplace=True)

df.index = idx


来源:https://stackoverflow.com/questions/56178297/variable-fillna-in-each-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!