calculation of 90 percentile and replacement of it by median by groups in R

扶醉桌前 提交于 2019-12-04 04:54:34

问题


Here part of data.

mydat=structure(list(code = c(123L, 123L, 123L, 123L, 123L, 123L, 123L, 
123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 
123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 
123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 
123L, 123L, 123L, 123L, 123L, 123L, 123L, 222L, 222L, 222L, 222L, 
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L), 
    item = c(234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 
    234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 
    234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 
    234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 
    234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 333L, 
    333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 
    333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 
    333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 
    333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 
    333L, 333L, 333L, 333L, 333L, 333L), return = c(25L, 25L, 
    21L, 37L, 23L, 27L, 19L, 7L, 16L, 12L, 33L, 24L, 6L, 14L, 
    4L, 25L, 90L, 27L, 3L, 16L, 7L, 1L, 13L, 11L, 36L, 5L, 6L, 
    14L, 11L, 41L, 11L, 6L, 4L, 11L, 3L, 6L, 21L, 41L, 28L, 30L, 
    92L, 4L, 1L, 83L, 3L, 16L, 4L, 25L, 25L, 21L, 37L, 23L, 27L, 
    19L, 7L, 16L, 12L, 33L, 24L, 6L, 14L, 4L, 25L, 90L, 27L, 
    3L, 16L, 7L, 1L, 13L, 11L, 36L, 5L, 6L, 14L, 11L, 41L, 11L, 
    6L, 4L, 11L, 3L, 6L, 21L, 41L, 28L, 30L, 92L, 4L, 1L, 83L, 
    3L, 16L, 4L), action = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("code", 
"item", "return", "action"), class = "data.frame", row.names = c(NA, 
-94L))

I have 2 group vars code+item. Here two groups:

123 234
222 333

Also i have action column. It can have only two values(category) zero(0) or one(1).

i need calculate 90 percentile by zero category of action of return column, which go before one category of action. Then i need calculate the median by zero category of action of return column, which go before one category of action. (After one we don't touch zeros)

These statistics must be calculate by 14 zeros preceding the one category of action

Then i have to find values that more 90 percentile which was calculated above, then such values must be replce by the median which was calculated.

After one category of action go zero category again for return column. For it i too have to find value that more 90 percentile which was calculeted above, then such value must be replaced by the median which was calculeted above(when calculate for 14 zeros).

Note the calculation is done by 14 zeros preceding the one category of action but replacing by median is done for all zero category of action and performing for each groups code+item

The result can be in output column.

to be more clear here desired output.

for 123+234 group the 90 perc=39,5 median=12

for 222+333 the 90 perc=39,5 median=12

 code item return action output
1   123  234     25      0     25
2   123  234     25      0     25
3   123  234     21      0     21
4   123  234     37      0     16
5   123  234     23      0     23
6   123  234     27      0     27
7   123  234     19      0     19
8   123  234      7      0      7
9   123  234     16      0     16
10  123  234     12      0     12
11  123  234     33      0     33
12  123  234     24      0     24
13  123  234      6      0      6
14  123  234     14      0     14
15  123  234      4      0      4
16  123  234     25      0     25
17  123  234     90      0     **12**
18  123  234     27      0     27
19  123  234      3      0      3
20  123  234     16      0     16
21  123  234      7      0      7
22  123  234      1      0      1
23  123  234     13      0     13
24  123  234     11      0     11
25  123  234     36      0     36
26  123  234      5      0      5
27  123  234      6      0      6
28  123  234     14      0     14
29  123  234     11      0     11
30  123  234     41      0     16
31  123  234     11      1     Na
32  123  234      6      1     Na
33  123  234      4      1     Na
34  123  234     11      1     Na
35  123  234      3      0      3
36  123  234      6      0      6
37  123  234     21      0     21
38  123  234     41      0     **12**
39  123  234     28      0     28
40  123  234     30      0     30
41  123  234     92      0     **12**
42  123  234      4      0      4
43  123  234      1      0      1
44  123  234     83      0     **12**
45  123  234      3      0      3
46  123  234     16      0     16
47  123  234      4      0      4
48  222  333     25      0     25
49  222  333     25      0     25
50  222  333     21      0     21
51  222  333     37      0     16
52  222  333     23      0     23
53  222  333     27      0     27
54  222  333     19      0     19
55  222  333      7      0      7
56  222  333     16      0     16
57  222  333     12      0     12
58  222  333     33      0     33
59  222  333     24      0     24
60  222  333      6      0      6
61  222  333     14      0     14
62  222  333      4      0      4
63  222  333     25      0     25
64  222  333     90      0     **12**
65  222  333     27      0     27
66  222  333      3      0      3
67  222  333     16      0     16
68  222  333      7      0      7
69  222  333      1      0      1
70  222  333     13      0     13
71  222  333     11      0     11
72  222  333     36      0     36
73  222  333      5      0      5
74  222  333      6      0      6
75  222  333     14      0     14
76  222  333     11      0     11
77  222  333     41      0     16
78  222  333     11      1     Na
79  222  333      6      1     Na
80  222  333      4      1     Na
81  222  333     11      1     Na
82  222  333      3      0      3
83  222  333      6      0      6
84  222  333     21      0     21
85  222  333     41      0     **12**
86  222  333     28      0     28
87  222  333     30      0     30
88  222  333     92      0     **12**
89  222  333      4      0      4
90  222  333      1      0      1
91  222  333     83      0     **12**
92  222  333      3      0      3
93  222  333     16      0     16
94  222  333      4      0      4

** i marked rows where value was replced by median.


回答1:


If I understand correctly, the OP wants to

  1. compute the 90% quantile and the median for the last 14 "zero action rows" (i.e., rows with action == 0) before the first row with action == 1, separately for each code, item group. (This implies that there is only one streak of action == 1 values per group.
  2. Copy the return values to the output for all rows with action == 0.
  3. Replace the output value if it is larger than the 90% quantile by the median in all zero action rows before the first action == 1 row in each group.

This can be solved by updating in a non-equi join with some preparations

library(data.table)
# mark the zero acton rows before the the action period
setDT(mydat)[, zero_before := cummax(action), by = .(code, item)]
# compute median and 90% quantile for that last 14 rows before each action period 
agg <- mydat[zero_before == 0, 
             quantile(tail(return, 14L), c(0.5, 0.9)) %>% 
               as.list()  %>% 
               set_names(c("med", "q90")) %>% 
               c(.(zero_before = 0)), by = .(code, item)]
agg

   code item med  q90 zero_before
1:  123  234  12 39.5           0
2:  222  333  12 39.5           0

# append output column
mydat[action == 0, output := as.double(return)][
  # replace output values greater q90 in an update non-equi join
  agg, on = .(code, item, action, return > q90), output := as.double(med)][
    # remove helper column
    , zero_before := NULL]

As mydat has 94 rows, we will show only the updated rows:

mydat[return != output]
    code item return action output
 1:  123  234     90      0     12
 2:  123  234     41      0     12
 3:  123  234     41      0     12
 4:  123  234     92      0     12
 5:  123  234     83      0     12
 6:  222  333     90      0     12
 7:  222  333     41      0     12
 8:  222  333     41      0     12
 9:  222  333     92      0     12
10:  222  333     83      0     12



回答2:


With tidyverse:

 mydat%>%
  group_by(code,item)%>%
  mutate(output=ifelse(return>quantile(return,.9) & action==0,median(return),return))


来源:https://stackoverflow.com/questions/51929941/calculation-of-90-percentile-and-replacement-of-it-by-median-by-groups-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!