Compare each pair of dates in two columns in python efficiently

问题

I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.

I have data in the form of:

    Start       End
0   2008-10-01  2008-10-31  
1   2006-07-01  2006-12-31  
2   2000-05-01  2002-12-31  
3   1971-08-01  1973-12-31  
4   1969-01-01  1969-12-31

I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:

dates['Correct'] = " "

And have began to check each date pair using the following, where my dataframe is called dates:

for index, row in dates.iterrows():
    if dates.Start[index] < dates.End[index]:
        dates.Correct[index] = "correct"
    elif dates.Start[index] == dates.End[index]:
        dates.Correct[index] = "same"
    elif dates.Start[index] > dates.End[index]:
        dates.Correct[index] = "incorrect"

Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?

回答1:

Why not just do it in a vectorized way:

is_correct = dates['Start'] < dates['End']
is_incorrect = dates['Start'] > dates['End']
is_same = ~is_correct & ~is_incorrect

回答2:

Since the list doesn't need to be compared sequentially, you can gain performance by splitting your dataset and then using multiple processes to perform the comparison simultaneously. Take a look at the multiprocessing module for help.

回答3:

Something like the following may be quicker:

import pandas as pd
import datetime

df = pd.DataFrame({
    'start': ["2008-10-01", "2006-07-01", "2000-05-01"],
    'end': ["2008-10-31", "2006-12-31", "2002-12-31"],
})


def comparison_check(df):
    start = datetime.datetime.strptime(df['start'], "%Y-%m-%d").date()
    end = datetime.datetime.strptime(df['end'], "%Y-%m-%d").date()
    if start < end:
        return "correct"
    elif start == end:
        return "same"
    return "incorrect"

In [23]: df.apply(comparison_check, axis=1)
Out[23]: 
0    correct
1    correct
2    correct
dtype: object

Timings

In [26]: %timeit df.apply(comparison_check, axis=1)
1000 loops, best of 3: 447 µs per loop

So by my calculations, 14,000 rows should take (447/3)*14,000 = (149 µs)*14,000 = 2.086s, so a might shorter than 15 minutes :)

来源：https://stackoverflow.com/questions/37498071/compare-each-pair-of-dates-in-two-columns-in-python-efficiently

标签

python

date

for-loop

data-cleaning