问题
Problem
Incoming data is a list of 0+ categories:
#input data frame
df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))})
categories
0 [A, B, C]
1 [B, C]
2 [A]
I would like to convert this to a DataFrame with one column per category and a 0/1 in each cell:
#desired output
A B C
0 1 1 1
1 0 1 1
2 1 0 0
Attempt
OneHotEncoder with LabelEncoder get stuck because they don't handle lists in cells. The desired result is currently achieved with nested for
loops:
#get unique categories ['A','B','C']
categories = np.unique(np.concatenate(x['categories']))
#make empty data frame
binary_df = pd.DataFrame(columns=[c for c in categories],
index=x.index)
print(binary_df)
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
#fill data frame
for i in binary_df.index:
for c in categories:
binary_df.loc[i][c] = 1 if c in np.concatenate(x.loc[i]) else 0
My concern is the loops indicate this is an extremely inefficient way to handle a large data set (tens of categories, ten-of-thousands or more rows).
Is there a way to achieve the result with built-in Numpy/Scikit functions?
回答1:
Solution:
pd.get_dummies(pd.DataFrame(df['categories'].tolist()).stack()).sum(level=0)
Out[98]:
A B C
0 1 1 1
1 0 1 1
2 1 0 0
How it works:
pd.DataFrame(df['categories'].tolist())
Out[100]:
0 1 2
0 A B C
1 B C None
2 A None None
gets the series of lists turned into a dataframe.
pd.DataFrame(df['categories'].tolist()).stack()
Out[101]:
0 0 A
1 B
2 C
1 0 B
1 C
2 0 A
dtype: object
prepares for get_dummies
while preserving the indices for later.
pd.get_dummies(pd.DataFrame(df['categories'].tolist()).stack())
Out[102]:
A B C
0 0 1 0 0
1 0 1 0
2 0 0 1
1 0 0 1 0
1 0 0 1
2 0 1 0 0
is almost there, but contains the garbage information of value index in the initial list.
So the solution above sums over this level of the MultiIndex.
Edit:
%timeit
results:
On original dataframe
df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))})
Solution provided in question:
100 loops, best of 3: 3.24 ms per loop
This solution:
100 loops, best of 3: 2.29 ms per loop
300 rows
df = pd.concat(100*[df]).reset_index(drop=True)
Solution provided in question:
1 loop, best of 3: 252 ms per loop
This solution:
100 loops, best of 3: 2.45 ms per loop
回答2:
You can try appending rows with map such that by default it would set to 0
and update to 1
if the column is present in input dataframe row
.
#input data frame
df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))})
print(df)
Output:
categories
0 [A, B, C]
1 [B, C]
2 [A]
For output dataframe
:
categories = np.unique(np.concatenate(df['categories']))
#make empty data frame
binary_df = pd.DataFrame(columns=[c for c in categories],
index=df.index).dropna()
for index, row in df.iterrows():
row_elements = row['categories']
default_row = {item:0 for item in categories}
# update corresponding row value by updating dictionary
for i in row_elements:
default_row[i] = 1
binary_df = binary_df.append(default_row, ignore_index=True)
print(binary_df)
Output:
A B C
0 1.0 1.0 1.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
来源:https://stackoverflow.com/questions/44657603/pandas-column-of-lists-to-separate-columns