Date Difference based on matching values in two columns - Pandas

问题

I have a dataframe, I am struggling to create a column based out of other columns, I will share the problem for a sample data.

          Date  Target1      Close
0   2018-05-25  198.0090    188.580002
1   2018-05-25  197.6835    188.580002
2   2018-05-25  198.0090    188.580002
3   2018-05-29  196.6230    187.899994
4   2018-05-29  196.9800    187.899994
5   2018-05-30  197.1375    187.500000
6   2018-05-30  196.6965    187.500000
7   2018-05-30  196.8750    187.500000
8   2018-05-31  196.2135    186.869995
9   2018-05-31  196.2135    186.869995
10  2018-05-31  196.5600    186.869995
11  2018-05-31  196.7700    186.869995
12  2018-05-31  196.9275    186.869995
13  2018-05-31  196.2135    186.869995
14  2018-05-31  196.2135    186.869995
15  2018-06-01  197.2845    190.240005
16  2018-06-01  197.2845    190.240005
17  2018-06-04  201.2325    191.830002
18  2018-06-04  201.4740    191.830002

I want to create another column (for each observation) (called days_to_hit_target for example) which is the difference of days such that close hits (or crosses target of specific day), then it counts the difference of days and put them in the column days_to_hit_target.

The idea is, suppose close price today in 2018-05-25 is 188.58, so, I want to get the date for which this target (198.0090) is hit close which it is doing somewhere later on 2018-06-04, where close has reached to the target of first observation, (198.0090), that will be fed to the first observation of the column (days_to_hit_target ).

回答1:

Use a combination of loc and at to find the date at which the target is hit, then subtract the dates.

df['TargetDate'] = 'NA'
for i, row in df.iterrows():
    t = row['Target1']
    d = row['Date']
    targdf = df.loc[df['Close'] >= t]
    if len(targdf)>0:
       targdt = targdf.at[0,'Date']
       df.at[i,'TargetDate'] = targdt
    else:
       df.at[i,'TargetDate'] = '0'

df['Diff'] = df['Date'].sub(df['TargetDate'], axis=0)

回答2:

import pandas as pd

csv = pd.read_csv(
    'sample.csv',
    parse_dates=['Date']
)

csv.sort_values('Date', inplace=True)

def find_closest(row):

    target = row['Target1']
    date = row['Date']

    matches = csv[
        (csv['Close'] >= target) &
        (csv['Date'] > date)
    ]

    closest_date = matches['Date'].iloc[0] if not matches.empty else None

    row['days to hit target'] = (closest_date - date).days if closest_date else None

    return row


final = csv.apply(find_closest, axis=1)

It's a bit hard to test because none of the targets appear in the close. But the idea is simple. Subset your original frame such that date is after the current row date and Close is greater than or equal to Target1 and get the first entry (this is after you've sorted it using df.sort_values.

If the subset is empty, use None. Otherwise, use the Date. Days to hit target is pretty simple at that point.

来源：https://stackoverflow.com/questions/55966950/date-difference-based-on-matching-values-in-two-columns-pandas

标签

python

python-3.x

pandas

numpy

data-science