Update Pandas Cells based on Column Values and Other Columns

廉价感情. 提交于 2019-12-10 18:50:58

问题


I am looking to update many columns based on the values in one column; this is easy with a loop but takes far too long for my application when there are many columns and many rows. What is the most elegant way to get the desired counts for each letter?

Desired Output:

   Things         count_A     count_B    count_C     count_D
['A','B','C']         1            1         1          0
['A','A','A']         3            0         0          0
['B','A']             1            1         0          0
['D','D']             0            0         0          2

回答1:


The most elegant is definitely the CountVectorizer from sklearn.

I'll show you how it works first, then I'll do everything in one line, so you can see how elegant it is.

First, we'll do it step by step:

let's create some data

raw = ['ABC', 'AAA', 'BA', 'DD']

things = [list(s) for s in raw]

Then read in some packages and initialize count vectorizer

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

cv = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)

Next we generate a matrix of counts

matrix = cv.fit_transform(things)

names = ["count_"+n for n in cv.get_feature_names()]

And save as a data frame

df = pd.DataFrame(data=matrix.toarray(), columns=names, index=raw)

Generating a data frame like this:

    count_A count_B count_C count_D
ABC 1   1   1   0
AAA 3   0   0   0
BA  1   1   0   0
DD  0   0   0   2

Elegant version:

Everything above in one line

df = pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)

Timing:

You mentioned that you're working with a rather large dataset, so I used the %%timeit function to give a time estimate.

Previous response by @piRSquared (which otherwise looks very good!)

pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)

100 loops, best of 3: 3.27 ms per loop

My answer:

pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)

1000 loops, best of 3: 1.08 ms per loop

According to my testing, CountVectorizer is about 3x faster.




回答2:


option 1
apply + value_counts

s = pd.Series([list('ABC'), list('AAA'), list('BA'), list('DD')], name='Things')

pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)

option 2
use pd.DataFrame(s.tolist()) + stack / groupby / unstack

pd.concat([s,
           pd.DataFrame(s.tolist()).stack() \
             .groupby(level=0).value_counts() \
             .unstack(fill_value=0)],
          axis=1)


来源:https://stackoverflow.com/questions/39987860/update-pandas-cells-based-on-column-values-and-other-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!