Delimiting contiguous regions with values above a certain threshold in Pandas DataFrame

后端 未结 2 1543
迷失自我
迷失自我 2021-01-04 23:29

I have a Pandas Dataframe of indices and values between 0 and 1, something like this:

 6  0.047033
 7  0.047650
 8  0.054067
 9  0.064767
10  0.073183
11  0.         


        
2条回答
  •  忘掉有多难
    2021-01-05 00:03

    I think this prints what you want. It is based heavily on Joe Kington's answer here I guess it is appropriate to up-vote that.

    import numpy as np
    
    # from Joe Kington's answer here https://stackoverflow.com/a/4495197/3751373
    # with minor edits
    def contiguous_regions(condition):
        """Finds contiguous True regions of the boolean array "condition". Returns
        a 2D array where the first column is the start index of the region and the
        second column is the end index."""
    
        # Find the indicies of changes in "condition"
        d = np.diff(condition,n=1, axis=0)
        idx, _ = d.nonzero() 
    
        # We need to start things after the change in "condition". Therefore, 
        # we'll shift the index by 1 to the right. -JK
        # LB this copy to increment is horrible but I get 
        # ValueError: output array is read-only without it 
    
        mutable_idx = np.array(idx)
        mutable_idx +=  1
        idx = mutable_idx
    
        if condition[0]:
            # If the start of condition is True prepend a 0
            idx = np.r_[0, idx]
    
        if condition[-1]:
            # If the end of condition is True, append the length of the array
            idx = np.r_[idx, condition.size] # Edit
    
        # Reshape the result into two columns
        idx.shape = (-1,2)
        return idx
    
    def main():
        import pandas as pd
        RUN_LENGTH_THRESHOLD = 5
        VALUE_THRESHOLD = 0.5
    
        np.random.seed(seed=901212)
        data = np.random.rand(500)*.5 + .35
    
        df = pd.DataFrame(data=data,columns=['values'])
    
        match_bools =  df.values > VALUE_THRESHOLD 
    
    
        print('with boolian array')
        for start, stop in contiguous_regions(match_bools):
            if (stop - start > RUN_LENGTH_THRESHOLD):
                print (start, stop)
    
    
    
    if __name__ == '__main__':
        main()
    

    I would be surprised if there were not more elegant ways

提交回复
热议问题