Use dplyr to take first and last row in a sequence by group [duplicate]

问题

I'm trying to use dplyr to take the first and last rows of repeated values by group. I'm doing this for efficiency reasons, particularly so that graphing is faster.

This is not a duplicate of Select first and last row from grouped data because I'm not asking for the strict first and last row in a group; I'm asking for the first and last row in a group by level (in my case 1's and 0's) that may appear in multiple chunks.

Here's an example. Say I want to remove all the redundant 1's and 0's from column C while keeping A and B intact.

df = data.frame(
    A = rep(c("a", "b"), each = 10),
    B = rep(c(1:10), 2),
    C = c(1,0,0,0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,1))

A  B C
a  1 1
a  2 0
a  3 0
a  4 0
a  5 0
a  6 0
a  7 1
a  8 1
a  9 1
a 10 1
b  1 0
b  2 0
b  3 0
b  4 1
b  5 0
b  6 0
b  7 0
b  8 0
b  9 0
b 10 1

The end result should look like this:

Using unique will either not remove anything or just take one of the 1's or 0's without retaining the start-and-end quality that I'm trying to achieve. Is there a way to do this without a loop, perhaps using dplyr or forcats?

回答1:

I think that slice should get you close:

df %>%
  group_by(A,C) %>%
  slice(c(1, n()))

gives

      A     B     C
  <chr> <int> <dbl>
1     a     2     0
2     a     6     0
3     a     1     1
4     a    10     1
5     b     1     0
6     b     9     0
7     b     4     1
8     b    10     1

though this doesn't quite match your expected outcome. n() gives the last row in the group.

After your edit it is clear that you are not looking for the values within any group that is established (which is what my previous version did). You want to group by those runs of 1's or 0's. For that, you will need to create a column that checks whether or not the run of 1's/0's has changed and then one to identify the groups. Then, slice will work as described before. However, because some of your runs are only 1 row long, we need to only include n() if it is more than 1 (otherwise the 1 row shows up twice).

df %>%
  mutate(groupChanged = (C != lag(C, default = C[1]))
         , toCutBy = cumsum(groupChanged)
         ) %>%
  group_by(toCutBy) %>%
  slice(c(1, ifelse(n() == 1, NA, n())))

Gives

       A     B     C groupChanged toCutBy
   <chr> <int> <dbl>        <lgl>   <int>
1      a     1     1        FALSE       0
2      a     2     0         TRUE       1
3      a     6     0        FALSE       1
4      a     7     1         TRUE       2
5      a    10     1        FALSE       2
6      b     1     0         TRUE       3
7      b     3     0        FALSE       3
8      b     4     1         TRUE       4
9      b     5     0         TRUE       5
10     b     9     0        FALSE       5
11     b    10     1         TRUE       6

If the runs of 1 or 0 must stay within the level in column A, you also need to add a check for a change in column A to the call. In this example, it does not have an effect (so returns exactly the same values), but it may be desirable in other instances.

df %>%
  mutate(groupChanged = (C != lag(C, default = C[1]) |
                           A != lag(A, default = A[1]))
         , toCutBy = cumsum(groupChanged)
  ) %>%
  group_by(toCutBy) %>%
  slice(c(1, ifelse(n() == 1, NA, n())))

回答2:

One solution:

C_filter <- function(x) {
    !sapply(1:length(x), function(i) {
        identical(x[i], x[i-1])
    }) | !sapply(1:length(x), function(i) {
        identical(x[i], x[i+1])
    }) 
}
df %>% group_by(A) %>% filter(C_filter(C))

   A  B C
1  a  1 1
2  a  2 0
3  a  6 0
4  a  7 1
5  a 10 1
6  b  1 0
7  b  3 0
8  b  4 1
9  b  5 0
10 b  9 0
11 b 10 1

来源：https://stackoverflow.com/questions/43126982/use-dplyr-to-take-first-and-last-row-in-a-sequence-by-group

标签

dplyr