How to vectorize code with nested if and loops in Python?

问题

I have a dataframe like given below

df = pd.DataFrame({
    'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],
    'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
    'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]
})
df['fake_flag'] = ''

In this operation, I am performing an operation as shown below in code. This code works fine and produces expected output but I can't use this approach for a real dataset as it has more than million records.

t1 = df['PEEP']
for i in t1.index:
   if i >=2:
      print("current value is  ", t1[i])
      print("preceding 1st (n-1) ", t1[i-1])
      print("preceding 2nd (n-2) ", t1[i-2])
         if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]):
            r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway
            print("rule 1 output is ", r1_output)
            if t1[i] >= r1_output + 3:
                print("found a value for rule 2", t1[i])
                print("check for next value is same as current value", t1[i+1])
                if (t1[i]==t1[i+1]):
                    print("fake flag is being set")
                    df['fake_flag'][i] = 'fake_vac'

However, I can't apply this to real data as it has more than million records. I am learning Python and can you help me understand how to vectorize my code in Python?

You can refer this post related post to understand the logic. As I have got the logic right, I have created this post mainly to seek help in vectorizing and fastening my code

I expect my output to be like as shown below

subject_id = 1

subject_id = 2

Is there any efficient and elegant way to fasten my code operation for a million records dataset

回答1:

Not sure what's the story behind this, but you can certainly vectorize three if independently and combine them together,

con1 = t1.shift(2).ge(t1.shift(1))
con2 = t1.ge(t1.shift(2).add(3))
con3 = t1.eq(t1.shift(-1))

df['fake_flag']=np.where(con1 & con2 & con3,'fake VAC','')

Edit (Groupby SubjectID)

con = lambda x: (x.shift(2).ge(x.shift(1))) & (x.ge(x.shift(2).add(3))) & (x.eq(x.shift(-1)))

df['fake_flag'] = df.groupby('subject_id')['PEEP'].transform(con).map({True:'fake VAC',False:''})

回答2:

Does this work?

df.groupby('subject_id')\
  .rolling(3)['PEEP'].apply(lambda x: (x[-1] - x[:2].max()) >= 3, raw=True).fillna(0).astype(bool)

Output:

subject_id    
1           0     False
            1     False
            2      True
            3     False
            4     False
            5     False
            6      True
            7     False
            8      True
            9     False
            10     True
            11    False
            12    False
            13    False
            14    False
            15    False
            16    False
            17    False
            18     True
            19    False
2           20    False
            21    False
            22    False
            23    False
            24    False
            25    False
            26     True
            27    False
            28    False
            29    False
            30    False
            31    False
            32     True
            33    False
            34     True
            35    False
            36    False
            37     True
            38    False
            39    False
Name: PEEP, dtype: bool

Details:

Use groupby to break the data up using 'subject_id'
Apply rolling with a n=3 or a window size three.
Look at that last value in that window using -1 indexing and subtact the maximum of the first two values in that window using index slicing.

来源：https://stackoverflow.com/questions/57676327/how-to-vectorize-code-with-nested-if-and-loops-in-python

标签

python

python-3.x

pandas

vectorization

pandas-groupby