Pandas - Create a table with a “dummy variable”

坚强是说给别人听的谎言 提交于 2019-12-11 12:19:07

问题


Let's say I have this dataframe

id | car      | sex    | income
-------------------------------
1  | European | Male   | 45000
2  | Japanese | Female | 48000
3  | American | Male   | 53000

Is there an easy way to create this one (using pandas) ?

  | id | car      | choice | sex    | income
1.| 1  | European | 1      | Male   | 45000
2.| 1  | American | 0      | Male   | 45000
3.| 1  | Japanese | 0      | Male   | 45000
  | ----------------------------------------
4.| 2  | European | 0      | Female | 48000
5.| 2  | American | 0      | Female | 48000
6.| 2  | Japanese | 1      | Female | 48000
  | ----------------------------------------
7.| 3  | European | 0      | Male   | 53000
8.| 3  | American | 1      | Male   | 53000
9.| 3  | Japanese | 0      | Male   | 53000

The idea is to get every modality of the "car" variable in the base, and to indicate which one every person has chosen through the "choice" variable. For example, in the first table, the person with the id 1 has chosen an European car, so choice equals 1 in the line where car is European, and zero in the line where car equals American or Japanese.

I already coded something which manually does it (using a dictionary) but I wanted to know if a cleaner solution exists.

(This has for aim to format the data in order to use an asclogit/nlogit under Stata)


回答1:


I believe need:

df = df.assign(choice = 1).set_index(['id','car'])
df = df.reindex(pd.MultiIndex.from_product(df.index.levels, names=df.index.names))

df = (df.assign(choice=df['choice'].fillna(0).astype(int))
        .groupby(level=0).apply(lambda x: x.ffill().bfill())
        .reset_index())
print (df)
   id       car     sex   income  choice
0   1  American    Male  45000.0       0
1   1  European    Male  45000.0       1
2   1  Japanese    Male  45000.0       0
3   2  American  Female  48000.0       0
4   2  European  Female  48000.0       0
5   2  Japanese  Female  48000.0       1
6   3  American    Male  53000.0       1
7   3  European    Male  53000.0       0
8   3  Japanese    Male  53000.0       0

Explanation:

1.First create new column with 1 by assign
2.Create MultiIndex by set_index
3.Generate all possible combination by MultiIndex.from_product and create new rows by reindex
4.Then replace NaNs to 0 in choice column
5.Last replace NaNs in all another columns by forward and backfilling.




回答2:


Not sure if you're referring to random binary integers rather than 'dummy variables'. 'dummy variables' are usually used to assign a numeric variable to non-numerical values. (Dummy Variables)

If you are trying to assign random binary integers to a new empty column, you can use numpy.random.rand (Numpy.random.rand)

As an example:

df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],'C': [1, 2, 3]})
  A  B  C
0 a  b  1
1 b  a  2
3 a  c  3

df['randNumCol'] = np.random.randint(0,1, size=len(df))

  A  B  C randNumCol
0 a  b  1     0
1 b  a  2     1 
3 a  c  3     0

Specify range of random integers to 0,1



来源:https://stackoverflow.com/questions/50551728/pandas-create-a-table-with-a-dummy-variable

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!