Python Pandas: how to convert a list of pair mappings to a row-vector format?

霸气de小男生 提交于 2020-01-04 10:02:06

问题


I have a 2-column DataFrame, column-1 corresponds to customer, column-2 corresponds to the city this customer has visited. The DataFrame looks like the following:

print(df)

    customer    visited_city
0   John        London
1   Mary        Melbourne
2   Steve       Paris
3   John        New_York
4   Peter       New_York
5   Mary        London
6   John        Melbourne
7   John        New_York

I would like to convert the above DataFrame into a row-vector format, such that each row represents a unique user with the row vector indicating the cities visited.

print(wide_format_df)

          London  Melbourne  New_York  Paris
John      1.0        1.0       1.0      0.0
Mary      1.0        1.0       0.0      0.0
Steve     0.0        0.0       0.0      1.0
Peter     0.0        0.0       1.0      0.0

Below is the code I used to generate the wide format. It iterates through each user one by one. I was wondering is there any more efficient way to do so?

import pandas as pd
import numpy as np

UNIQUE_CITIESS = np.sort(df['visited_city'].unique())
p = len(UNIQUE_CITIESS)
unique_customers = df['customer'].unique().tolist()

X = []
for customer in unique_customers:
    x = np.zeros(p)    
    city_visited = np.sort(df[df['customer'] == customer]['visited_city'].unique())
    visited_idx = np.searchsorted(UNIQUE_CITIESS, city_visited)
    x[visited_idx] = 1    
    X.append(x)
wide_format_df = pd.DataFrame(np.array(X), columns=UNIQUE_CITIESS, index=unique_customers)
wide_format_df

回答1:


Please note that your question has been edited such that the answers provided no longer answer your question. They must adjust to only return 1 for John in New York despite the fact he's been there twice.

Option 1 pir1
I like this answer because I think it's elegant.

pd.get_dummies(df.customer).T.dot(pd.get_dummies(df.visited_city)).clip(0, 1)

       London  Melbourne  New_York  Paris
John        1          1         1      0
Mary        1          1         0      0
Peter       0          0         1      0
Steve       0          0         0      1

Option 2 pir2
This answer should be fast.

i, r = pd.factorize(df.customer.values)
j, c = pd.factorize(df.visited_city.values)
n, m = r.size, c.size
b = np.zeros((n, m), dtype=int)
b[i, j] = 1

pd.DataFrame(b, r, c).sort_index().sort_index(1)

       London  Melbourne  New_York  Paris
John        1          1         1      0
Mary        1          1         0      0
Peter       0          0         1      0
Steve       0          0         0      1

Option 3 pir3
Practical and pretty quick

df.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)

visited_city  London  Melbourne  New_York  Paris
customer                                        
John               1          1         1      0
Mary               1          1         0      0
Peter              0          0         1      0
Steve              0          0         0      1

Timing
Code Below

# Multiples of Minimum time
#
           pir1  pir2      pir3       wen       vai
10     1.392237   1.0  1.521555  4.337469  5.569029
30     1.445762   1.0  1.821047  5.977978  7.204843
100    1.679956   1.0  1.901502  6.685429  7.296454
300    1.568407   1.0  1.825047  5.556880  7.210672
1000   1.622137   1.0  1.613983  5.815970  5.396008
3000   1.808637   1.0  1.852953  4.159305  4.224724
10000  1.654354   1.0  1.502092  3.145032  2.950560
30000  1.555574   1.0  1.413612  2.404061  2.299856

wen = lambda d: d.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0)
vai = lambda d: pd.crosstab(d.customer, d.visited_city)
pir1 = lambda d: pd.get_dummies(d.customer).T.dot(pd.get_dummies(d.visited_city)).clip(0, 1)
pir3 = lambda d: d.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)

def pir2(d):
    i, r = pd.factorize(d.customer.values)
    j, c = pd.factorize(d.visited_city.values)
    n, m = r.size, c.size
    b = np.zeros((n, m), dtype=int)
    b[i, j] = 1

    return pd.DataFrame(b, r, c).sort_index().sort_index(1)

results = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='pir1 pir2 pir3 wen vai'.split(),
    dtype=float
)

for i in results.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.at[i, j] = timeit(stmt, setp, number=10)

print((lambda r: r.div(r.min(1), 0))(results))

results.plot(loglog=True)



回答2:


You can use crosstab

pd.crosstab(df.customer, df.visited_city)

You get

visited_city    London  Melbourne   New_York    Paris
customer                
John            1       1           1           0
Mary            1       1           0           0
Peter           0       0           1           0
Steve           0       0           0           1



回答3:


Or you can using

df.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0)

visited_city  London  Melbourne  New_York  Paris
customer                                        
John               1          1         1      0
Mary               1          1         0      0
Peter              0          0         1      0
Steve              0          0         0      1


来源:https://stackoverflow.com/questions/45911087/python-pandas-how-to-convert-a-list-of-pair-mappings-to-a-row-vector-format

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!