问题
I have a pandas dataframe df with dates as strings:
Date1 Date2
2017-08-31 1970-01-01 17:35:00
2017-10-31 1970-01-01 15:00:00
2017-11-30 1970-01-01 16:30:00
2017-10-31 1970-01-01 16:00:00
2017-10-31 1970-01-01 16:12:00
What I want to do is replace each date part in the Date2 column with the corresponding date in Date1 but leave the time untouched, so the output is:
Date1 Date2
2017-08-31 2017-08-31 17:35:00
2017-10-31 2017-10-31 15:00:00
2017-11-30 2017-11-30 16:30:00
2017-10-31 2017-10-31 16:00:00
2017-10-31 2017-10-31 16:12:00
I have achieved this using pandas replace and regex's as such
import re
date_reg = re.compile(r"([0-9]{4}\-[0-9]{2}\-[0-9]{2})")
df['Market Close Time'].replace(to_replace=date_reg, value=df['Date1'], inplace=True)
But this method is very slow (>10 minutes) for a dataframe with only 150k rows.
The solution from this post implements numpy np.where which is much faster - how can I use the np.where in this example, or is there another more efficient way to perform this operation?
回答1:
One idea is:
df['Date3'] = ['{} {}'.format(a, b.split()[1]) for a, b in zip(df['Date1'], df['Date2'])]
Or:
df['Date3'] = df['Date1'] + ' ' + df['Date2'].str.split().str[1]
print (df)
Date1 Date2 Date3
0 2017-08-31 1970-01-01 17:35:00 2017-08-31 17:35:00
1 2017-10-31 1970-01-01 15:00:00 2017-10-31 15:00:00
2 2017-11-30 1970-01-01 16:30:00 2017-11-30 16:30:00
3 2017-10-31 1970-01-01 16:00:00 2017-10-31 16:00:00
4 2017-10-31 1970-01-01 16:12:00 2017-10-31 16:12:00
Or:
df['Date3'] = pd.to_datetime(df['Date1']) + pd.to_timedelta(df['Date2'].str.split().str[1])
print (df)
Date1 Date2 Date3
0 2017-08-31 1970-01-01 17:35:00 2017-08-31 17:35:00
1 2017-10-31 1970-01-01 15:00:00 2017-10-31 15:00:00
2 2017-11-30 1970-01-01 16:30:00 2017-11-30 16:30:00
3 2017-10-31 1970-01-01 16:00:00 2017-10-31 16:00:00
4 2017-10-31 1970-01-01 16:12:00 2017-10-31 16:12:00
Timings:
In [302]: %timeit df['Date3'] = ['{} {}'.format(a, b.split()[1]) for a, b in zip(df['Date1'], df['Date2'])]
30.2 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [303]: %timeit df['Date3'] = df['Date1'] + ' ' + df['Date2'].str.split().str[1]
66.4 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
回答2:
Another way is to
df.Date2 = df.Date1.str[:].values + df.Date2.str[10:].values
df.Date1.str[:].values will get the Date1 field as a numpy array and likewise with Date2 field.
str[10:] is done to extract the time part of Date2 which is appended to the date from Date1.
Timings: 2.26 ms ± 82.2 µs
%timeit df.d2 = df.d1.str[:].values + df.d2.str[10:].values
2.26 ms ± 82.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
来源:https://stackoverflow.com/questions/50583265/efficiently-replace-part-of-value-from-one-column-with-value-from-another-column