问题
I am looking to update many columns based on the values in one column; this is easy with a loop but takes far too long for my application when there are many columns and many rows. What is the most elegant way to get the desired counts for each letter?
Desired Output:
Things count_A count_B count_C count_D
['A','B','C'] 1 1 1 0
['A','A','A'] 3 0 0 0
['B','A'] 1 1 0 0
['D','D'] 0 0 0 2
回答1:
The most elegant is definitely the CountVectorizer from sklearn.
I'll show you how it works first, then I'll do everything in one line, so you can see how elegant it is.
First, we'll do it step by step:
let's create some data
raw = ['ABC', 'AAA', 'BA', 'DD']
things = [list(s) for s in raw]
Then read in some packages and initialize count vectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
cv = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)
Next we generate a matrix of counts
matrix = cv.fit_transform(things)
names = ["count_"+n for n in cv.get_feature_names()]
And save as a data frame
df = pd.DataFrame(data=matrix.toarray(), columns=names, index=raw)
Generating a data frame like this:
count_A count_B count_C count_D
ABC 1 1 1 0
AAA 3 0 0 0
BA 1 1 0 0
DD 0 0 0 2
Elegant version:
Everything above in one line
df = pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)
Timing:
You mentioned that you're working with a rather large dataset, so I used the %%timeit function to give a time estimate.
Previous response by @piRSquared (which otherwise looks very good!)
pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)
100 loops, best of 3: 3.27 ms per loop
My answer:
pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)
1000 loops, best of 3: 1.08 ms per loop
According to my testing, CountVectorizer is about 3x faster.
回答2:
option 1apply + value_counts
s = pd.Series([list('ABC'), list('AAA'), list('BA'), list('DD')], name='Things')
pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)
option 2
use pd.DataFrame(s.tolist()) + stack / groupby / unstack
pd.concat([s,
pd.DataFrame(s.tolist()).stack() \
.groupby(level=0).value_counts() \
.unstack(fill_value=0)],
axis=1)
来源:https://stackoverflow.com/questions/39987860/update-pandas-cells-based-on-column-values-and-other-columns