If there\'s not a quick 1-3 liner for this in R, I\'ll definitely just use linux sort
and a short python program using groupby
, so don\'t bend over
In dplyr, I would do this by creating another grouping variable for the consecutive rows. This is what the code cumsum(c(1, diff(weight) != 0)
is doing in the code chunk below. An example of this is also here.
The group creation can be done within group_by
, and then you can proceed accordingly with making any summaries by group.
library(dplyr)
df_in %>%
group_by(ID, group_weight = cumsum(c(1, diff(weight) != 0)), weight) %>%
summarise(start_day = min(start_day), end_day = max(end_day))
Source: local data frame [5 x 5]
Groups: ID, group_weight [?]
ID group_weight weight start_day end_day
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1 150 1 7
2 1 2 151 7 10
3 1 3 150 10 30
4 2 4 170 5 20
5 2 5 171 20 30
This approach does leave you with the extra grouping variable in the dataset, which can be removed, if needed, with select(-group_weight)
after ungrouping.
First we combine ID
and weight
. The quick-and-dirty way is using paste:
df_in$id_weight <- paste(df_in$id, df_in$weight, sep='_')
df_in
ID weight start_day end_day id_weight
1 1 150 1 4 1_150
2 1 150 4 7 1_150
3 1 151 7 10 1_151
4 1 150 10 11 1_150
5 1 150 11 30 1_150
6 2 170 5 10 2_170
7 2 170 10 15 2_170
8 2 170 15 20 2_170
9 2 171 20 25 2_171
10 2 171 25 30 2_171
Safer way is to use interaction
or group_indices
: Combine values in 4 columns to a single unique value
We can group consecutively using rle
.
rlel <- rle(df_in$id_weight)$lengths
df_in$group <- unlist(lapply(1:length(rlel), function(i) rep(i, rlel[i])))
df_in
ID weight start_day end_day id_weight group
1 1 150 1 4 1_150 1
2 1 150 4 7 1_150 1
3 1 151 7 10 1_151 2
4 1 150 10 11 1_150 3
5 1 150 11 30 1_150 3
6 2 170 5 10 2_170 4
7 2 170 10 15 2_170 4
8 2 170 15 20 2_170 4
9 2 171 20 25 2_171 5
10 2 171 25 30 2_171 5
Now with the convenient group number we can summarize by group.
df_in %>%
group_by(group) %>%
summarize(id_weight = id_weight[1],
start_day = min(start_day),
end_day = max(end_day))
# A tibble: 5 x 4
group id_weight start_day end_day
<int> <chr> <dbl> <dbl>
1 1 1_150 1 7
2 2 1_151 7 10
3 3 1_150 10 30
4 4 2_170 5 20
5 5 2_171 20 30
with(df_in, {
aggregate(day, list('ID'=ID, 'weight'=weight),
function(x) c('min_day' = min(x), 'max_day' = max(x)))
})
Produces:
ID weight x.min_day x.max_day
1 1 150 1 5
2 1 151 3 3
3 2 170 1 3
4 2 171 4 5