Efficiently check if value is present in any of given ranges

前端 未结 2 938
猫巷女王i
猫巷女王i 2021-01-06 10:27

I have two pandas DataFrame objects:

  • A contains \'start\' and \'finish\' columns

  • B has c

相关标签:
2条回答
  • 2021-01-06 11:10

    IIUC you want the output to be True if there is at least one interval in which the date is?

    Is an apply(lambda) efficient enough for you? (It might be a little long for a big dataframe as it iterates over the rows of B). If it is, you can try this:

    def in_range(date,start,finish):
        return (True in ((start < date) & (date < finish)).unique())
    
    B.date.apply(lambda x: in_range(x,A.start,A.finish))
    

    Output:

    0     True
    1    False
    2     True
    3    False
    4    False
    

    EDIT: MaxU's answer works better in fact. Here are the timers for 10 000 rows dataframes (A and B):

    %timeit B2.date.apply(lambda x: in_range(x,A2.start,A2.finish))
    1 loop, best of 3: 9.82 s per loop
    
    %timeit B2.date.apply(lambda x: ((x >= A2.start) & (x <= A2.finish)).any())
    1 loop, best of 3: 7.31 s per loop
    
    0 讨论(0)
  • 2021-01-06 11:26

    You can do it with a O(n) complexity. The idea is to transform the representation. In A, you store one row per interval. I would suggest a dataframe which stores one row per transition (ie entering an interval, leaving an interval).

    A = pd.DataFrame(
        data={
            'start': [1, 50, 30],
            'finish': [3, 83, 42]    
        }
    )
    
    starts = pd.DataFrame(data={'start': 1}, index=A.start.tolist())
    finishs = pd.DataFrame(data={'finish': -1}, index=A.finish.tolist())
    transitions = pd.merge(starts, finishs, how='outer', left_index=True, right_index=True).fillna(0)
    transitions
    
        start  finish
    1       1       0
    3       0      -1
    30      1       0
    42      0      -1
    50      1       0
    83      0      -1
    

    this dataframe stores per date the type of transitions. Now, we need to know at each date if we are in an interval or not. It looks like counting the opening & closing parenthesis. You can do:

    transitions['transition'] = (transitions.pop('finish') + transitions.pop('start')).cumsum()
    transitions
    
        transition
    1            1
    3            0
    30           1
    42           0
    50           1
    83           0
    

    Here it says:

    • At 1, i'm in an interval
    • At 3, i'm not
    • In general, if the value is strictly greater than 0, it's in an interval.
    • Note that this handles overlapping interval

    And now you merge with your B dataframe:

    B = pd.DataFrame(
        index=[31, 20, 2.5, 84, 1000]
    )
    
    pd.merge(transitions, B, how='outer', left_index=True, right_index=True).fillna(method='ffill').loc[B.index].astype(bool)
    
           transition
    31.0         True
    20.0        False
    2.5          True
    84.0        False
    1000.0      False
    
    0 讨论(0)
提交回复
热议问题