问题
I have a 2-column DataFrame, column-1 corresponds to customer, column-2 corresponds to the city this customer has visited. The DataFrame looks like the following:
print(df)
customer visited_city
0 John London
1 Mary Melbourne
2 Steve Paris
3 John New_York
4 Peter New_York
5 Mary London
6 John Melbourne
7 John New_York
I would like to convert the above DataFrame into a row-vector format, such that each row represents a unique user with the row vector indicating the cities visited.
print(wide_format_df)
London Melbourne New_York Paris
John 1.0 1.0 1.0 0.0
Mary 1.0 1.0 0.0 0.0
Steve 0.0 0.0 0.0 1.0
Peter 0.0 0.0 1.0 0.0
Below is the code I used to generate the wide format. It iterates through each user one by one. I was wondering is there any more efficient way to do so?
import pandas as pd
import numpy as np
UNIQUE_CITIESS = np.sort(df['visited_city'].unique())
p = len(UNIQUE_CITIESS)
unique_customers = df['customer'].unique().tolist()
X = []
for customer in unique_customers:
x = np.zeros(p)
city_visited = np.sort(df[df['customer'] == customer]['visited_city'].unique())
visited_idx = np.searchsorted(UNIQUE_CITIESS, city_visited)
x[visited_idx] = 1
X.append(x)
wide_format_df = pd.DataFrame(np.array(X), columns=UNIQUE_CITIESS, index=unique_customers)
wide_format_df
回答1:
Please note that your question has been edited such that the answers provided no longer answer your question. They must adjust to only return 1
for John
in New York
despite the fact he's been there twice.
Option 1 pir1
I like this answer because I think it's elegant.
pd.get_dummies(df.customer).T.dot(pd.get_dummies(df.visited_city)).clip(0, 1)
London Melbourne New_York Paris
John 1 1 1 0
Mary 1 1 0 0
Peter 0 0 1 0
Steve 0 0 0 1
Option 2 pir2
This answer should be fast.
i, r = pd.factorize(df.customer.values)
j, c = pd.factorize(df.visited_city.values)
n, m = r.size, c.size
b = np.zeros((n, m), dtype=int)
b[i, j] = 1
pd.DataFrame(b, r, c).sort_index().sort_index(1)
London Melbourne New_York Paris
John 1 1 1 0
Mary 1 1 0 0
Peter 0 0 1 0
Steve 0 0 0 1
Option 3 pir3
Practical and pretty quick
df.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)
visited_city London Melbourne New_York Paris
customer
John 1 1 1 0
Mary 1 1 0 0
Peter 0 0 1 0
Steve 0 0 0 1
Timing
Code Below
# Multiples of Minimum time
#
pir1 pir2 pir3 wen vai
10 1.392237 1.0 1.521555 4.337469 5.569029
30 1.445762 1.0 1.821047 5.977978 7.204843
100 1.679956 1.0 1.901502 6.685429 7.296454
300 1.568407 1.0 1.825047 5.556880 7.210672
1000 1.622137 1.0 1.613983 5.815970 5.396008
3000 1.808637 1.0 1.852953 4.159305 4.224724
10000 1.654354 1.0 1.502092 3.145032 2.950560
30000 1.555574 1.0 1.413612 2.404061 2.299856
wen = lambda d: d.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0)
vai = lambda d: pd.crosstab(d.customer, d.visited_city)
pir1 = lambda d: pd.get_dummies(d.customer).T.dot(pd.get_dummies(d.visited_city)).clip(0, 1)
pir3 = lambda d: d.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)
def pir2(d):
i, r = pd.factorize(d.customer.values)
j, c = pd.factorize(d.visited_city.values)
n, m = r.size, c.size
b = np.zeros((n, m), dtype=int)
b[i, j] = 1
return pd.DataFrame(b, r, c).sort_index().sort_index(1)
results = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
columns='pir1 pir2 pir3 wen vai'.split(),
dtype=float
)
for i in results.index:
d = pd.concat([df] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.at[i, j] = timeit(stmt, setp, number=10)
print((lambda r: r.div(r.min(1), 0))(results))
results.plot(loglog=True)
回答2:
You can use crosstab
pd.crosstab(df.customer, df.visited_city)
You get
visited_city London Melbourne New_York Paris
customer
John 1 1 1 0
Mary 1 1 0 0
Peter 0 0 1 0
Steve 0 0 0 1
回答3:
Or you can using
df.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0)
visited_city London Melbourne New_York Paris
customer
John 1 1 1 0
Mary 1 1 0 0
Peter 0 0 1 0
Steve 0 0 0 1
来源:https://stackoverflow.com/questions/45911087/python-pandas-how-to-convert-a-list-of-pair-mappings-to-a-row-vector-format