R sum consecutive duplicate rows and remove all but first

问题

I am stuck with a probably simple question - how to sum consecutive duplicate rows and remove all but first row. And, if there is a NA in between two duplicates (such as 2,na,2) , also sum them and remove all but the first entry. So far so good, here is my sample data

ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))

sample output

Now I want to 1.) sum the "time" column of consecutive ia's - i.e., sum the time if the number 1 occurs twice or more right after each other, in my case here sum first and second row of column time to 4.5+2.4.

2.) if there is a NA in between two numbers (ia column) which are the same (i.e., ia = 2, NA, 2), then also sum all of those times.

3.) keep only first occurence of the ia, and delete the rest.

In the end, I would want to have something like this:

 a
       ia time
    1   1  6.9
    3   2  6.3
    6   1  20.4
    10  2  7.3
    11  1  2.3
    12  2  4.3

I found this for summing, but it does not take into account the consecutive factor

aggregate(time~ia,data=a,FUN=sum)

and I found this for deleting

a[cumsum(rle(as.numeric(a[,1]))$lengths),]

although the rle approach keeps the last entry, and I would want to keep the first. I also have no idea how to handle the NAs.

if I have a pattern of 1-NA-2 then the NA should NOT be counted with either of them, in this case the NA row should be removed.

回答1:

You first need to replace sequences of NAs with the values surrounding them (if they are the same). This answer shows zoo's na.locf function, which fills in NAs with the last observation. By testing whether it's the same when you carry values backwards or forwards, you can filter out the NAs you don't want, then do the carrying forward:

library(dplyr)
library(zoo)

a %>%
  filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
  mutate(ia = na.locf(ia))
#>    ia time
#> 1   1  4.5
#> 2   1  2.4
#> 3   2  3.6
#> 4   2  1.5
#> 5   2  1.2
#> 6   1  4.9
#> 7   1  6.4
#> 8   1  4.4
#> 9   2  7.3
#> 10  1  2.3
#> 11  2  4.3

Now that you've fixed those NAs, you can group consecutive sets of values using cumsum. The full solution is:

result <- a %>%
  filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
  mutate(ia = na.locf(ia)) %>%
  mutate(change = ia != lag(ia, default = FALSE)) %>%
  group_by(group = cumsum(change), ia) %>%
  summarise(time = sum(time))
result
#> Source: local data frame [6 x 3]
#> Groups: group [?]
#> 
#>   group    ia  time
#>   (int) (dbl) (dbl)
#> 1     1     1   6.9
#> 2     2     2   6.3
#> 3     3     1  15.7
#> 4     4     2   7.3
#> 5     5     1   2.3
#> 6     6     2   4.3

If you want to get rid of the group column, use the additional lines:

result %>%
  ungroup() %>%
  select(-group)

回答2:

With data.table (as RHertel suggested for na.locf):

library(data.table)
library(zoo)

setDT(a)[na.locf(ia, fromLast=T)==na.locf(ia), sum(time), cumsum(c(T,!!diff(na.locf(ia))))]
#   id   V1
#1:  1  6.9
#2:  2  6.3
#3:  3 20.4
#4:  4  7.3
#5:  5  2.3
#6:  6  4.3

回答3:

nas <- which(is.na(df$ia))
add.index <- sapply(nas, function(x) {logi <- which(as.logical(df$ia))
  aft <- logi[logi > x][1]
  fore <- tail(logi[logi< x], 1)
  if(df$ia[aft] == df$ia[fore]) aft else NA})
df$ia[nas] <- df$ia[add.index]
df <- df[complete.cases(df),]

First we determine if the NA values of the column are surrounded by the same value. If yes, the surrounding value replaces the NA. There is no problem if the data has consecutive NA values.

Next we do a standard sum by group operation. cumsum allows us to create a unique group based on changes in the numbers.

df$grps <- cumsum(c(F, !df$ia[-length(df$ia)] == df$ia[-1]))+1
aggregate(time ~ grps, df, sum)
#   grps time
# 1    1  6.9
# 2    2  6.3
# 3    3 20.4
# 4    4  7.3
# 5    5  2.3
# 6    6  4.3

This is a base R approach. With packages like dplyr, zoo, or data.table different options are available as they come built with specialized functions to do what we did here.

来源：https://stackoverflow.com/questions/32588433/r-sum-consecutive-duplicate-rows-and-remove-all-but-first

标签

dataframe

duplicates