问题
I have a dataframe like given below
df = pd.DataFrame({
'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],
'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]
})
df['fake_flag'] = ''
In this operation, I am performing an operation as shown below in code. This code works fine and produces expected output but I can't use this approach for a real dataset as it has more than million records.
t1 = df['PEEP']
for i in t1.index:
if i >=2:
print("current value is ", t1[i])
print("preceding 1st (n-1) ", t1[i-1])
print("preceding 2nd (n-2) ", t1[i-2])
if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]):
r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway
print("rule 1 output is ", r1_output)
if t1[i] >= r1_output + 3:
print("found a value for rule 2", t1[i])
print("check for next value is same as current value", t1[i+1])
if (t1[i]==t1[i+1]):
print("fake flag is being set")
df['fake_flag'][i] = 'fake_vac'
However, I can't apply this to real data as it has more than million records. I am learning Python and can you help me understand how to vectorize my code in Python?
You can refer this post related post to understand the logic. As I have got the logic right, I have created this post mainly to seek help in vectorizing and fastening my code
I expect my output to be like as shown below
subject_id = 1
subject_id = 2
Is there any efficient and elegant way to fasten my code operation for a million records dataset
回答1:
Not sure what's the story behind this, but you can certainly vectorize three if
independently and combine them together,
con1 = t1.shift(2).ge(t1.shift(1))
con2 = t1.ge(t1.shift(2).add(3))
con3 = t1.eq(t1.shift(-1))
df['fake_flag']=np.where(con1 & con2 & con3,'fake VAC','')
Edit (Groupby SubjectID)
con = lambda x: (x.shift(2).ge(x.shift(1))) & (x.ge(x.shift(2).add(3))) & (x.eq(x.shift(-1)))
df['fake_flag'] = df.groupby('subject_id')['PEEP'].transform(con).map({True:'fake VAC',False:''})
回答2:
Does this work?
df.groupby('subject_id')\
.rolling(3)['PEEP'].apply(lambda x: (x[-1] - x[:2].max()) >= 3, raw=True).fillna(0).astype(bool)
Output:
subject_id
1 0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 True
9 False
10 True
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 True
19 False
2 20 False
21 False
22 False
23 False
24 False
25 False
26 True
27 False
28 False
29 False
30 False
31 False
32 True
33 False
34 True
35 False
36 False
37 True
38 False
39 False
Name: PEEP, dtype: bool
Details:
- Use
groupby
to break the data up using 'subject_id' - Apply
rolling
with a n=3 or a window size three. - Look at that last value in that window using -1 indexing and subtact the maximum of the first two values in that window using index slicing.
来源:https://stackoverflow.com/questions/57676327/how-to-vectorize-code-with-nested-if-and-loops-in-python