Use dplyr to take first and last row in a sequence by group [duplicate]

好久不见. 提交于 2021-02-10 14:51:54

问题


I'm trying to use dplyr to take the first and last rows of repeated values by group. I'm doing this for efficiency reasons, particularly so that graphing is faster.

This is not a duplicate of Select first and last row from grouped data because I'm not asking for the strict first and last row in a group; I'm asking for the first and last row in a group by level (in my case 1's and 0's) that may appear in multiple chunks.

Here's an example. Say I want to remove all the redundant 1's and 0's from column C while keeping A and B intact.

df = data.frame(
    A = rep(c("a", "b"), each = 10),
    B = rep(c(1:10), 2),
    C = c(1,0,0,0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,1))

A  B C
a  1 1
a  2 0
a  3 0
a  4 0
a  5 0
a  6 0
a  7 1
a  8 1
a  9 1
a 10 1
b  1 0
b  2 0
b  3 0
b  4 1
b  5 0
b  6 0
b  7 0
b  8 0
b  9 0
b 10 1

The end result should look like this:

A  B C
a  1 1
a  2 0
a  6 0
a  7 1
a 10 1
b  1 0
b  3 0
b  4 1
b  5 0
b  9 0
b 10 1

Using unique will either not remove anything or just take one of the 1's or 0's without retaining the start-and-end quality that I'm trying to achieve. Is there a way to do this without a loop, perhaps using dplyr or forcats?


回答1:


I think that slice should get you close:

df %>%
  group_by(A,C) %>%
  slice(c(1, n()))

gives

      A     B     C
  <chr> <int> <dbl>
1     a     2     0
2     a     6     0
3     a     1     1
4     a    10     1
5     b     1     0
6     b     9     0
7     b     4     1
8     b    10     1

though this doesn't quite match your expected outcome. n() gives the last row in the group.

After your edit it is clear that you are not looking for the values within any group that is established (which is what my previous version did). You want to group by those runs of 1's or 0's. For that, you will need to create a column that checks whether or not the run of 1's/0's has changed and then one to identify the groups. Then, slice will work as described before. However, because some of your runs are only 1 row long, we need to only include n() if it is more than 1 (otherwise the 1 row shows up twice).

df %>%
  mutate(groupChanged = (C != lag(C, default = C[1]))
         , toCutBy = cumsum(groupChanged)
         ) %>%
  group_by(toCutBy) %>%
  slice(c(1, ifelse(n() == 1, NA, n())))

Gives

       A     B     C groupChanged toCutBy
   <chr> <int> <dbl>        <lgl>   <int>
1      a     1     1        FALSE       0
2      a     2     0         TRUE       1
3      a     6     0        FALSE       1
4      a     7     1         TRUE       2
5      a    10     1        FALSE       2
6      b     1     0         TRUE       3
7      b     3     0        FALSE       3
8      b     4     1         TRUE       4
9      b     5     0         TRUE       5
10     b     9     0        FALSE       5
11     b    10     1         TRUE       6

If the runs of 1 or 0 must stay within the level in column A, you also need to add a check for a change in column A to the call. In this example, it does not have an effect (so returns exactly the same values), but it may be desirable in other instances.

df %>%
  mutate(groupChanged = (C != lag(C, default = C[1]) |
                           A != lag(A, default = A[1]))
         , toCutBy = cumsum(groupChanged)
  ) %>%
  group_by(toCutBy) %>%
  slice(c(1, ifelse(n() == 1, NA, n())))



回答2:


One solution:

C_filter <- function(x) {
    !sapply(1:length(x), function(i) {
        identical(x[i], x[i-1])
    }) | !sapply(1:length(x), function(i) {
        identical(x[i], x[i+1])
    }) 
}
df %>% group_by(A) %>% filter(C_filter(C))

   A  B C
1  a  1 1
2  a  2 0
3  a  6 0
4  a  7 1
5  a 10 1
6  b  1 0
7  b  3 0
8  b  4 1
9  b  5 0
10 b  9 0
11 b 10 1


来源:https://stackoverflow.com/questions/43126982/use-dplyr-to-take-first-and-last-row-in-a-sequence-by-group

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!