New Dataframe column as a generic function of other rows (pandas)

一世执手 提交于 2019-12-18 08:46:05

问题


What is the fastest (and most efficient) way to create a new column in a DataFrame that is a function of other rows in pandas ?

Consider the following example:

import pandas as pd

d = {
    'id': [1, 2, 3, 4, 5, 6],
    'word': ['cat', 'hat', 'hag', 'hog', 'dog', 'elephant']
}
pandas_df = pd.DataFrame(d)

Which yields:

   id word
0   1  cat
1   2  hat
2   3  hag
3   4  hog
4   5  dog
5   6  elephant

Suppose I want to create a new column bar containing a value that is based on the output of using a function foo to compare the word in the current row to the other rows in the dataframe.

def foo(word1, word2):
    # do some calculation
    return foobar  # in this example, the return type is numeric

threshold = some_threshold

for index, _id, word in pandas_df.itertuples():
    value = sum(
        pandas_df[pandas_df['word'] != word].apply(
            lambda x: foo(x['word'], word),
            axis=1
        ) < threshold
    )
    pandas_df.loc[index, 'bar'] = value

This does produce the correct output, but it uses itertuples() and apply(), which is not performant for large DataFrames.

Is there a way to vectorize (is that the correct term?) this approach? Or is there another better (faster) way to do this?

Notes / Updates:

  1. In the original post, I used edit distance/levenshtein distance as the foo function. I have changed the question in an attempt to be more generic. The idea is that the function to be applied is to compare the current rows value against all other rows and return some aggregate value.

If foo was nltk.metrics.distance.edit_distance and the threshold was set to 2 (as in the original post), this produces the output below:

   id word        bar
0   1  cat        1.0
1   2  hat        2.0
2   3  hag        2.0
3   4  hog        2.0
4   5  dog        1.0
5   6  elephant   0.0
  1. I have the same question for spark dataframes as well. I thought it made sense to split these into two posts so they are not too broad. However, I have generally found that solutions to similar pandas problems can sometimes be modified to work for spark.

  2. Inspired by this answer to my spark version of this question, I tried to use a cartesian product in pandas. My speed tests indicate that this is slightly faster (though I suspect that may vary with the size of the data). Unfortunately, I still can't get around calling apply().

Example code:

from nltk.metrics.distance import edit_distance as edit_dist

pandas_df2 = pd.DataFrame(d)

i, j = np.where(np.ones((len(pandas_df2), len(pandas_df2))))
cart = pandas_df2.iloc[i].reset_index(drop=True).join(
    pandas_df2.iloc[j].reset_index(drop=True), rsuffix='_r'
)

cart['dist'] = cart.apply(lambda x: edit_dist(x['word'], x['word_r']), axis=1)
pandas_df2 = (
    cart[cart['dist'] < 2].groupby(['id', 'word']).count()['dist'] - 1
).reset_index()

回答1:


Let's try to analyze the problem for a second:

If you have N rows, then you have N*N "pairs" to consider in your similarity function. In the general case, there is no escape from evaluating all of them (sounds very rational, but I can't prove it). Hence, you have at least O(n^2) time complexity.

What you can try, however, is to play with the constant factors of that time complexity. The possible options I found are:


1. Parallelization:

Since you have some large DataFrame, parallelizing the processing is the best obvious choice. That will gain you (almost) linear improvement in time complexity, so if you have 16 workers you will gain (almost) 16x improvement.

For example, we can partition the rows of the df into disjoint parts, and process each part individually, then combine the results. A very basic parallel code might look like this:

from multiprocessing import cpu_count,Pool

def work(part):
    """
    Args:
        part (DataFrame) : a part (collection of rows) of the whole DataFrame.

    Returns:
        DataFrame: the same part, with the desired property calculated and added as a new column
    """
     # Note that we are using the original df (pandas_df) as a global variable
     # But changes made in this function will not be global (a side effect of using multiprocessing).
    for index, _id, word in part.itertuples(): # iterate over the "part" tuples
        value = sum(
            pandas_df[pandas_df['word'] != word].apply( # Calculate the desired function using the whole original df
                lambda x: foo(x['word'], word),
                axis=1
            ) < threshold
        )
        part.loc[index, 'bar'] = value
    return part

# New code starts here ...

cores = cpu_count() #Number of CPU cores on your system

data_split = np.array_split(data, cores) # Split the DataFrame into parts
pool = Pool(cores) # Create a new thread pool
new_parts = pool.map(work , data_split) # apply the function `work` to each part, this will give you a list of the new parts
pool.close() # close the pool
pool.join()
new_df = pd.concat(new_parts) # Concatenate the new parts

Note: I've tried to keep the code as close to OP's code as possible. This is just a basic demonstration code and a lot of better alternatives exist.


2. "Low level" optimizations:

Another solution is to try to optimize the similarity function computation and iterating/mapping. I don't think this will gain you much speedup compared to the previous option or the next one.


3. Function-dependent pruning:

The last thing you can try are similarity-function-dependent improvements. This doesn't work in the general case, but will work very well if you can analyze the similarity function. For example:

Assuming you are using Levenshtein distance (LD), you can observe that the distance between any two strings is >= the difference between their lengths. i.e. LD(s1,s2) >= abs(len(s1)-len(s2)) .

You can use this observation to prune the possible similar pairs to consider for evaluation. So for each string with length l1, compare it only with strings having length l2 having abs(l1-l2) <= limit. (limit is the maximum accepted dis-similarity, 2 in your provided example).

Another observation is that LD(s1,s2) = LD(s2,s1). That cuts the number of pairs by a factor of 2.

This solution may actually get you down to O(n) time complexity (depends highly on the data).
Why? you may ask.
That's because if we had 10^9 rows, but on average we have only 10^3 rows with "close" length to each row, then we need to evaluate the function for about 10^9 * 10^3 /2 pairs, instead of 10^9 * 10^9 pairs. But that's (again) depends on the data. This approach will be useless if (in this example) you have strings all which have length 3.




回答2:


Thoughts about preprocessing (groupby)

Because you are looking for edit distance less than 2, you can first group by the length of strings. If the difference of length between groups is greater or equal to 2, you do not need to compare them. (This part is quite similar to Qusai Alothman's answer in section 3. H)

Thus, first thing is to group by the length of the string.

df["length"] = df.word.str.len() 
df.groupby("length")["id", "word"]

Then, you compute the edit distance between every two consecutive group if the difference in length is less than or equal to 2. This does not directly relate to your question but I hope it would be helpful.

Potential vectorization (after groupby)

After that, you may also try to vectorize the computation by splitting each string into characters. Note that if the cost of splitting is greater than the vectorized benefits it carries, you should not do this. Or when you are creating the data frame, just create one that with characters rather than words.

We will use the answer in Pandas split dataframe column for every character to split a string into a list of characters.

# assuming we had groupped the df.
df_len_3 = pd.DataFrame({"word": ['cat', 'hat', 'hag', 'hog', 'dog']})
# turn it into chars
splitted = df_len_3.word.apply(lambda x: pd.Series(list(x)))

    0   1   2
0   c   a   t
1   h   a   t
2   h   a   g
3   h   o   g
4   d   o   g

splitted.loc[0] == splitted # compare one word to all words

    0       1       2
0   True    True    True  -> comparing to itself is always all true.
1   False   True    True
2   False   True    False
3   False   False   False
4   False   False   False


splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1

0    1
1    2
2    2
3    2
4    1
dtype: int64

Explanation of splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1

For each row, lambda x: (x == splitted) compares each row to the whole df just like splitted.loc[0] == splitted above. It will generate a true/false table.

Then, we sum up the table horizontally with a .sum(axis=1) following (x == splitted).

Then, we want to find out which words are similar. Thus, we apply a ge function that checks the number of true is over a threshold. Here, we only allow difference to be 1, so it is set to be len(x)-1.

Finally, we will have to subtract the whole array by 1 because we compare each word with itself in operation. We will want to exclude self-comparison.

Note, this vectorization part only works for within-group similarity checking. You still need to check groups with different length with the edit distance approach, I suppose.



来源:https://stackoverflow.com/questions/48174398/new-dataframe-column-as-a-generic-function-of-other-rows-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!