问题
What is the fastest (and most efficient) way to create a new column in a DataFrame
that is a function of other rows in pandas
?
Consider the following example:
import pandas as pd
d = {
'id': [1, 2, 3, 4, 5, 6],
'word': ['cat', 'hat', 'hag', 'hog', 'dog', 'elephant']
}
pandas_df = pd.DataFrame(d)
Which yields:
id word
0 1 cat
1 2 hat
2 3 hag
3 4 hog
4 5 dog
5 6 elephant
Suppose I want to create a new column bar
containing a value that is based on the output of using a function foo
to compare the word in the current row to the other rows in the dataframe
.
def foo(word1, word2):
# do some calculation
return foobar # in this example, the return type is numeric
threshold = some_threshold
for index, _id, word in pandas_df.itertuples():
value = sum(
pandas_df[pandas_df['word'] != word].apply(
lambda x: foo(x['word'], word),
axis=1
) < threshold
)
pandas_df.loc[index, 'bar'] = value
This does produce the correct output, but it uses itertuples()
and apply()
, which is not performant for large DataFrames
.
Is there a way to vectorize (is that the correct term?) this approach? Or is there another better (faster) way to do this?
Notes / Updates:
- In the original post, I used edit distance/levenshtein distance as the
foo
function. I have changed the question in an attempt to be more generic. The idea is that the function to be applied is to compare the current rows value against all other rows and return some aggregate value.
If foo
was nltk.metrics.distance.edit_distance
and the threshold
was set to 2
(as in the original post), this produces the output below:
id word bar
0 1 cat 1.0
1 2 hat 2.0
2 3 hag 2.0
3 4 hog 2.0
4 5 dog 1.0
5 6 elephant 0.0
I have the same question for
spark dataframes
as well. I thought it made sense to split these into two posts so they are not too broad. However, I have generally found that solutions to similarpandas
problems can sometimes be modified to work forspark
.Inspired by this answer to my
spark
version of this question, I tried to use a cartesian product inpandas
. My speed tests indicate that this is slightly faster (though I suspect that may vary with the size of the data). Unfortunately, I still can't get around callingapply()
.
Example code:
from nltk.metrics.distance import edit_distance as edit_dist
pandas_df2 = pd.DataFrame(d)
i, j = np.where(np.ones((len(pandas_df2), len(pandas_df2))))
cart = pandas_df2.iloc[i].reset_index(drop=True).join(
pandas_df2.iloc[j].reset_index(drop=True), rsuffix='_r'
)
cart['dist'] = cart.apply(lambda x: edit_dist(x['word'], x['word_r']), axis=1)
pandas_df2 = (
cart[cart['dist'] < 2].groupby(['id', 'word']).count()['dist'] - 1
).reset_index()
回答1:
Let's try to analyze the problem for a second:
If you have N
rows, then you have N*N
"pairs" to consider in your similarity function. In the general case, there is no escape from evaluating all of them (sounds very rational, but I can't prove it). Hence, you have at least O(n^2) time complexity.
What you can try, however, is to play with the constant factors of that time complexity. The possible options I found are:
1. Parallelization:
Since you have some large DataFrame
, parallelizing the processing is the best obvious choice. That will gain you (almost) linear improvement in time complexity, so if you have 16 workers you will gain (almost) 16x improvement.
For example, we can partition the rows of the df
into disjoint parts, and process each part individually, then combine the results.
A very basic parallel code might look like this:
from multiprocessing import cpu_count,Pool
def work(part):
"""
Args:
part (DataFrame) : a part (collection of rows) of the whole DataFrame.
Returns:
DataFrame: the same part, with the desired property calculated and added as a new column
"""
# Note that we are using the original df (pandas_df) as a global variable
# But changes made in this function will not be global (a side effect of using multiprocessing).
for index, _id, word in part.itertuples(): # iterate over the "part" tuples
value = sum(
pandas_df[pandas_df['word'] != word].apply( # Calculate the desired function using the whole original df
lambda x: foo(x['word'], word),
axis=1
) < threshold
)
part.loc[index, 'bar'] = value
return part
# New code starts here ...
cores = cpu_count() #Number of CPU cores on your system
data_split = np.array_split(data, cores) # Split the DataFrame into parts
pool = Pool(cores) # Create a new thread pool
new_parts = pool.map(work , data_split) # apply the function `work` to each part, this will give you a list of the new parts
pool.close() # close the pool
pool.join()
new_df = pd.concat(new_parts) # Concatenate the new parts
Note: I've tried to keep the code as close to OP's code as possible. This is just a basic demonstration code and a lot of better alternatives exist.
2. "Low level" optimizations:
Another solution is to try to optimize the similarity function computation and iterating/mapping. I don't think this will gain you much speedup compared to the previous option or the next one.
3. Function-dependent pruning:
The last thing you can try are similarity-function-dependent improvements. This doesn't work in the general case, but will work very well if you can analyze the similarity function. For example:
Assuming you are using Levenshtein distance (LD
), you can observe that the distance between any two strings is >= the difference between their lengths. i.e. LD(s1,s2) >= abs(len(s1)-len(s2))
.
You can use this observation to prune the possible similar pairs to consider for evaluation. So for each string with length l1
, compare it only with strings having length l2
having abs(l1-l2) <= limit
. (limit is the maximum accepted dis-similarity, 2 in your provided example).
Another observation is that LD(s1,s2) = LD(s2,s1)
. That cuts the number of pairs by a factor of 2.
This solution may actually get you down to O(n)
time complexity (depends highly on the data).
Why? you may ask.
That's because if we had 10^9
rows, but on average we have only 10^3
rows with "close" length to each row, then we need to evaluate the function for about 10^9 * 10^3 /2
pairs, instead of 10^9 * 10^9
pairs. But that's (again) depends on the data. This approach will be useless if (in this example) you have strings all which have length 3.
回答2:
Thoughts about preprocessing (groupby)
Because you are looking for edit distance less than 2, you can first group by the length of strings. If the difference of length between groups is greater or equal to 2, you do not need to compare them. (This part is quite similar to Qusai Alothman's answer in section 3. H)
Thus, first thing is to group by the length of the string.
df["length"] = df.word.str.len()
df.groupby("length")["id", "word"]
Then, you compute the edit distance between every two consecutive group if the difference in length is less than or equal to 2. This does not directly relate to your question but I hope it would be helpful.
Potential vectorization (after groupby)
After that, you may also try to vectorize the computation by splitting each string into characters. Note that if the cost of splitting is greater than the vectorized benefits it carries, you should not do this. Or when you are creating the data frame, just create one that with characters rather than words.
We will use the answer in Pandas split dataframe column for every character to split a string into a list of characters.
# assuming we had groupped the df.
df_len_3 = pd.DataFrame({"word": ['cat', 'hat', 'hag', 'hog', 'dog']})
# turn it into chars
splitted = df_len_3.word.apply(lambda x: pd.Series(list(x)))
0 1 2
0 c a t
1 h a t
2 h a g
3 h o g
4 d o g
splitted.loc[0] == splitted # compare one word to all words
0 1 2
0 True True True -> comparing to itself is always all true.
1 False True True
2 False True False
3 False False False
4 False False False
splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1
0 1
1 2
2 2
3 2
4 1
dtype: int64
Explanation of splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1
For each row, lambda x: (x == splitted)
compares each row to the whole df just like splitted.loc[0] == splitted
above. It will generate a true/false table.
Then, we sum up the table horizontally with a .sum(axis=1)
following (x == splitted)
.
Then, we want to find out which words are similar. Thus, we apply a ge
function that checks the number of true is over a threshold. Here, we only allow difference to be 1, so it is set to be len(x)-1
.
Finally, we will have to subtract the whole array by 1 because we compare each word with itself in operation. We will want to exclude self-comparison.
Note, this vectorization part only works for within-group similarity checking. You still need to check groups with different length with the edit distance approach, I suppose.
来源:https://stackoverflow.com/questions/48174398/new-dataframe-column-as-a-generic-function-of-other-rows-pandas