Error when calculating values greater than 95% quantile using plyr

问题

My data is structured as follows:

Individ <- data.frame(Participant = c("Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", 
                                      "Harry", "Harry", "Harry", "Harry","Harry", "Harry", "Harry", "Harry", "Paul", "Paul", "Paul", "Paul"),
                      Time = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
                      Condition = c("Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", 
                                    "Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr"),
                      Power = c(400, 250, 180, 500, 300, 450, 600, 512, 300, 500, 450, 200, 402, 210, 130, 520, 310, 451, 608, 582, 390, 570, NA, NA))

Using dplyr I apply a rolling average (from 2 to 4 seconds) via the following code:

for (summaryFunction in c("mean")) {
  for ( i in seq(2, 4, by = 1)) {
    tempColumn <- Individ %>%
      group_by(Participant) %>%
      transmute(rollapply(Power,
                          width = i, 
                          FUN = summaryFunction, 
                          align = "right", 
                          fill = NA, 
                          na.rm = T))
    colnames(tempColumn)[2] <- paste("Rolling", summaryFunction, as.character(i), sep = ".")
    Individ <- bind_cols(Individ, tempColumn[2])
  }
}

I now wish to calculate the top 5% of Power for each Participant across each of the rolling averages. To compute this, I use:

Output = ddply(Individ, .(Participant, Condition), summarise,
           TwoSec <- Rolling.mean.2 > quantile(Rolling.mean.2 , 0.95, na.rm = TRUE))

However, I end up with a column that states TRUE or FALSE. Instead, I am after the actual values that are in the top 5%. How do I do this? Is there also an easier way to iterate over each rolling mean column, by participant and condition, to find the top 5% in each?

Thank you!

回答1:

It's good that you got your rolling data table, that made the job of calculating the quantiles a lot easier.

Step 1: Group by Participant, Condition, Location

Individ <- data.frame(Participant = c("Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", 
                                      "Harry", "Harry", "Harry", "Harry","Harry", "Harry", "Harry", "Harry", "Paul", "Paul", "Paul", "Paul"),
                      Time = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
                      Condition = c("Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", 
                                    "Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr"),
                      Location = c("Home", "Home", "Home", "Home", "Away", "Away", "Away", "Away", "Home", "Home", "Home", "Home", 
                                   "Home", "Home", "Home", "Home", "Away", "Away", "Away", "Away", "Home", "Home", "Home", "Home"),
                      Power = c(400, 250, 180, 500, 300, 450, 600, 512, 300, 500, 450, 200, 402, 210, 130, 520, 310, 451, 608, 582, 390, 570, NA, NA))


library(dplyr)
library(zoo)
for (summaryFunction in c("mean")) {
  for ( i in seq(2, 4, by = 1)) {
    tempColumn <- Individ %>%
      group_by(Participant) %>%
      transmute(rollapply(Power,
                          width = i, 
                          FUN = summaryFunction, 
                          align = "right", 
                          fill = NA, 
                          na.rm = T))
    colnames(tempColumn)[2] <- paste("Rolling", summaryFunction, as.character(i), sep = ".")
    Individ <- bind_cols(Individ, tempColumn[2])
  }
}


Individ


     Participant  Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4
        (fctr) (dbl)    (fctr)   (fctr) (dbl)          (dbl)          (dbl)          (dbl)
1         Bill     1   Placebo     Home   400             NA             NA             NA
2         Bill     2   Placebo     Home   250            325             NA             NA
3         Bill     3   Placebo     Home   180            215       276.6667             NA
4         Bill     4   Placebo     Home   500            340       310.0000          332.5
5         Bill     1      Expr     Away   300            400       326.6667          307.5
6         Bill     2      Expr     Away   450            375       416.6667          357.5
7         Bill     3      Expr     Away   600            525       450.0000          462.5
8         Bill     4      Expr     Away   512            556       520.6667          465.5
9         Bill     1      Expr     Home   300            406       470.6667          465.5
10        Bill     2      Expr     Home   500            400       437.3333          478.0

After getting all 7 or 8 columns (this dataset includes Location), so it answers the other question as well, in the new Individ dataset, here's what I did to solve your problem. I'm 100% sure there is a cleaner and more efficient way to do this, but there is logic here and it should output fine.

Step 2: Get Quantiles for groups

library(plyr)
Individ[is.na(Individ)]<- 0
Top_percentiles <- ddply(Individ, 
                         c("Participant", "Condition", "Location"), 
                         summarise, 
                         Power2 = quantile(Rolling.mean.2, .95),
                         Power3 = quantile(Rolling.mean.3, .95),
                         Power4 = quantile(Rolling.mean.4, .95)
                         )

Top_percentiles

  Participant Condition Location  Power2   Power3  Power4
1        Bill      Expr     Away 551.350 510.0667 465.050
2        Bill      Expr     Home 464.650 465.6667 476.125
3        Bill   Placebo     Home 337.750 305.0000 282.625
4       Harry      Expr     Away 585.175 533.4000 485.425
5       Harry   Placebo     Home 322.150 280.7667 268.175
6        Paul      Expr     Home 556.500 556.5000 408.000

which is the threshold for the top 5% for each group and the corresponding rolling averages.

Now the only thing left to do is calculate the observations in your dataset that are above each threshold.

Step 3: Match the rolling average columns with original dataset

Something like this is kinda what I am tinkering around with.

Individ$Power2 <- Top_percentiles$Power2[match(Individ$Participant, Top_percentiles$Participant) &&  
                                         match(Individ$Condition, Top_percentiles$Condition) &&
                                         match(Individ$Location, Top_percentiles$Location)]

Individ$Power3 <- Top_percentiles$Power3[match(Individ$Participant, Top_percentiles$Participant) &&  
                                           match(Individ$Condition, Top_percentiles$Condition) &&
                                           match(Individ$Location, Top_percentiles$Location)]

Individ$Power4 <- Top_percentiles$Power4[match(Individ$Participant, Top_percentiles$Participant) &&  
                                           match(Individ$Condition, Top_percentiles$Condition) &&
                                           match(Individ$Location, Top_percentiles$Location)]


Individ


    Participant  Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4  Power2   Power3
        (fctr) (dbl)    (fctr)   (fctr) (dbl)          (dbl)          (dbl)          (dbl)   (dbl)    (dbl)
1         Bill     1   Placebo     Home   400              0         0.0000            0.0 551.350 510.0667
2         Bill     2   Placebo     Home   250            325         0.0000            0.0 464.650 465.6667
3         Bill     3   Placebo     Home   180            215       276.6667            0.0 337.750 305.0000
4         Bill     4   Placebo     Home   500            340       310.0000          332.5 585.175 533.4000
5         Bill     1      Expr     Away   300            400       326.6667          307.5 322.150 280.7667
6         Bill     2      Expr     Away   450            375       416.6667          357.5 556.500 556.5000
7         Bill     3      Expr     Away   600            525       450.0000          462.5 551.350 510.0667
8         Bill     4      Expr     Away   512            556       520.6667          465.5 464.650 465.6667
9         Bill     1      Expr     Home   300            406       470.6667          465.5 337.750 305.0000
10        Bill     2      Expr     Home   500            400       437.3333          478.0 585.175 533.4000

My idea here was to match the quantile columns onto the Individ dataset.

Step 4: Filter dataset

This should get you want you want.

Option 1: Three separate datasets

top_percentile_2sec <- Individ %>% filter(Rolling.mean.2 >= Power2)
top_percentile_3sec <- Individ %>% filter(Rolling.mean.3 >= Power3)
top_percentile_4sec <- Individ %>% filter(Rolling.mean.4 >= Power4)

Option 2: One big merged dataset

top_percentile_all_times <- Individ %>% filter(Rolling.mean.2 >= Power2 | Rolling.mean.3 >= Power3 | Rolling.mean.4 >= Power4)


top_percentile_all_times

 Participant  Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4 Power2   Power3
       (fctr) (dbl)    (fctr)   (fctr) (dbl)          (dbl)          (dbl)          (dbl)  (dbl)    (dbl)
1        Bill     1      Expr     Away   300          400.0       326.6667         307.50 322.15 280.7667
2        Bill     4      Expr     Away   512          556.0       520.6667         465.50 464.65 465.6667
3        Bill     1      Expr     Home   300          406.0       470.6667         465.50 337.75 305.0000
4        Bill     3      Expr     Home   450          475.0       416.6667         440.50 322.15 280.7667
5       Harry     1      Expr     Away   310          415.0       320.0000         292.50 322.15 280.7667
6       Harry     3      Expr     Away   608          529.5       456.3333         472.25 551.35 510.0667
7       Harry     4      Expr     Away   582          595.0       547.0000         487.75 464.65 465.6667
8        Paul     3      Expr     Home     0          570.0       480.0000           0.00 322.15 280.7667
9        Paul     4      Expr     Home     0            0.0       570.0000         480.00 556.50 556.5000

Below is a link that greatly helped me out.

how to calculate 95th percentile of values with grouping variable in R or Excel

Did this solve your problem from the other post as well?

来源：https://stackoverflow.com/questions/35692701/error-when-calculating-values-greater-than-95-quantile-using-plyr

标签

subset

plyr