Consecutive occurrence in a data frame

冷暖自知 提交于 2021-02-04 19:43:05

问题


I have the above data frame containing different measurements. I would like to identify consecutive measurements (with the length size of more or equal with 6) of w taken at a time t. For example, in the case of id 1 from t3:t8 there are 6 consecutive w measures recorded.

I would like to save the results into 2 data frames:

df1: At least 6 consecutive measurements of w (per id) before the first occurrence of w;
df2: From timing of the last occurrence of w (per id) there are less than 6 consecutive measurements of w;
    

The format of my dataset with and without consecutive w occurrences:

 id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
  1  s  s  w  w  w  w  w  w  w  w #7 occ. of w after t3
  2  s  w  w  w  e  w  w  w  w  w  #no 6 consecutive w occurance
  3  w  w  w  w  w  w  s  s  s  r #6 occ. of w before t6
  4  e  w  w  w  w  w  w  w  w  w #9 occ. of w after t1
  5  w  w  w  w  w  w  r  w  w  w #6 occ. of w before t7
  6  w  s  w  r  w  r  w  w  s  w #no 6 consecutive w occurance

Output:

Before w:

id t1 t2 t3 t4 t5 t6 t7 t8 t9 10
3                  w  s  s  s  r
5                  w  r  
   
After w:

id t1 t2 t3 t4 t5 t6 t7 t8 t9 10
1   s  s  w
4   e  w

Sample data:

df<-structure(list(id=c(1,2,3,4,5,6), t1=c("s","s","w","e","w","w"), t2=c("s","w","w","w","w","s"),t3 = c("w","w","w","w","w","w"),
                        t4 = c("w","w","w","w","w","r"), t5 = c("w","e","w","w","w","w"), t6 = c("w","w","w","w","w","r"),
                       t7= c("w","w","s","w","r","w"), t8 = c("w","w","s","w","w","w"), t9=c("e","w","s","w","w","s"), t10=c("w","w","r","w","w","w")), row.names = c(NA, 6L), class = "data.frame")
    

Codes:

Before (Not working for at least 6 consecutive time steps):

df1 <- df
df1[-1] <- t(apply(df[-1], 1, function(x) replace(x, seq_along(x) > match('w', x), '')))
df1<-df1[rowSums(df1 == 'w')!=0,  ,drop = FALSE]

After (Not working for at least 6 consecutive time steps):

df2 <- df
df2[-1] <- t(apply(df[-1], 1, function(x) replace(x, seq_along(x) <= match('w', x), '')))

df2 <- df2[c(TRUE, colSums(df2[-2] != '') > 0)]
df2<-df2[rowSums(df2 == 'w')!=0,  ,drop = FALSE]

回答1:


Not very smart and more experimental, but you could try:

library(tidyverse)

df <- pivot_longer(df, -id) %>%
  group_by(id, idx = rep(1:length(rle(value)$length), times = rle(value)$length)) %>%
  filter(any(cumsum(value == 'w') == 6 & value == 'w') | value != 'w') %>%
  group_by(id) %>% select(-idx) %>%
  filter(any(value == 'w')) %>%
  mutate(w_consec = cumsum(value == 'w'),
         group = case_when(
           any(value != 'w' & w_consec == 0) ~ 'After',
           any(value != 'w' & w_consec == 6) ~ 'Before')) %>%
  filter(
    if (any(group == 'After')) (value == 'w' & w_consec == 1) | (value != 'w' & w_consec == 0)
    else w_consec == 6
    ) %>%
  pivot_wider(id_cols = c('id', 'group'), names_from = name, values_from = value)

With grouping by idx variable in the second step, we ensure that we only keep occurrences of w which belong to a consecutive set of 6 repeats. Otherwise we could run into an issue where with example sequence wwwwwwebww, we would lose eb information as all w would be included in next steps, thus ending with a single w. rle function is used in this case to assign the same value to all consecutive occurrences of any character (the way it is used above has the same behaviour as data.table::rleid function, you can check help page for the latter to get more context).

After that, you can use split:

split(df, df$group)

Output:

$After
# A tibble: 2 x 10
# Groups:   id [2]
     id group t1    t2    t3    t6    t7    t8    t9    t10  
  <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1     1 After s     s     w     NA    NA    NA    NA    NA   
2     4 After e     w     NA    NA    NA    NA    NA    NA   

$Before
# A tibble: 2 x 10
# Groups:   id [2]
     id group  t1    t2    t3    t6    t7    t8    t9    t10  
  <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1     3 Before NA    NA    NA    w     s     s     s     r    
2     5 Before NA    NA    NA    w     r     NA    NA    NA   

If you want to include it within your environment as separate data frames:

list2env(
  split(df, df$group), .GlobalEnv
)



回答2:


This problem is probably easier to solve with regex. Specifically, (^|[^w]+)w(?=w{6}) and (?<=([^w]|^)(w{5}))w([^w]+|$).

Combine all columns into a single string.

library("tidyverse")

df_original <- read_table("
 id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
  1  s  s  w  w  w  w  w  w  w  w
  2  s  w  w  w  e  w  w  w  w  w
  3  w  w  w  w  w  w  s  s  s  r
  4  e  w  w  w  w  w  w  w  w  w
  5  w  w  w  w  w  w  r  w  w  w
  6  w  s  w  r  w  r  w  w  s  w
")

df <- df_original %>% unite(col = "combined", -id, sep = "") 
df
#> # A tibble: 6 x 2
#>      id combined  
#>   <dbl> <chr>     
#> 1     1 sswwwwwwww
#> 2     2 swwwewwwww
#> 3     3 wwwwwwsssr
#> 4     4 ewwwwwwwww
#> 5     5 wwwwwwrwww
#> 6     6 wswrwrwwsw

str_locate can be used to find the start and end points of interest using regex.

  • (^|[^w]+)w(?=w{6}) means find non-w followed by w followed by 6 ws.
  • (?<=([^w]|^)(w{5}))w([^w]+|$) means find non-w followed by 5 ws followed by w followed by non-w.

See ?stringi::about_search_regex for syntax details.

df1 <- 
  df %>%
  mutate(end_points = str_locate(combined, "(^|[^w]+)w(?=w{6})"))
df1
#> # A tibble: 6 x 3
#>      id combined   end_points[,"start"] [,"end"]
#>   <dbl> <chr>                     <int>    <int>
#> 1     1 sswwwwwwww                    1        3
#> 2     2 swwwewwwww                   NA       NA
#> 3     3 wwwwwwsssr                   NA       NA
#> 4     4 ewwwwwwwww                    1        2
#> 5     5 wwwwwwrwww                   NA       NA
#> 6     6 wswrwrwwsw                   NA       NA

df2 <-
  df %>%
  mutate(end_points = str_locate(combined, "(?<=([^w]|^)(w{5}))w([^w]+|$)"))
df2
#> # A tibble: 6 x 3
#>      id combined   end_points[,"start"] [,"end"]
#>   <dbl> <chr>                     <int>    <int>
#> 1     1 sswwwwwwww                   NA       NA
#> 2     2 swwwewwwww                   NA       NA
#> 3     3 wwwwwwsssr                    6       10
#> 4     4 ewwwwwwwww                   NA       NA
#> 5     5 wwwwwwrwww                    6        7
#> 6     6 wswrwrwwsw                   NA       NA

To turn the end points into a masked string, we can use mask_string.

mask_string <- function(string, start, end) {
  result <- str_pad("", nchar(string))
  str_sub(result, start, end) <- str_sub(string, start, end)
  result
}

df1 <-
  df1 %>%
  mutate(masked = mask_string(combined, end_points[, "start"], end_points[, "end"]))
df1
#> # A tibble: 6 x 4
#>      id combined   end_points[,"start"] [,"end"] masked      
#>   <dbl> <chr>                     <int>    <int> <chr>       
#> 1     1 sswwwwwwww                    1        3 "ssw       "
#> 2     2 swwwewwwww                   NA       NA  NA         
#> 3     3 wwwwwwsssr                   NA       NA  NA         
#> 4     4 ewwwwwwwww                    1        2 "ew        "
#> 5     5 wwwwwwrwww                   NA       NA  NA         
#> 6     6 wswrwrwwsw                   NA       NA  NA      

df2 <-
  df2 %>%
  mutate(masked = mask_string(combined, end_points[, "start"], end_points[, "end"]))
df2
#> # A tibble: 6 x 4
#>      id combined   end_points[,"start"] [,"end"] masked      
#>   <dbl> <chr>                     <int>    <int> <chr>       
#> 1     1 sswwwwwwww                   NA       NA  NA         
#> 2     2 swwwewwwww                   NA       NA  NA         
#> 3     3 wwwwwwsssr                    6       10 "     wsssr"
#> 4     4 ewwwwwwwww                   NA       NA  NA         
#> 5     5 wwwwwwrwww                    6        7 "     wr   "
#> 6     6 wswrwrwwsw                   NA       NA  NA         

Then, this can be mapped backed to the columns t1, t2 etc like this.

df1 %>% 
  filter(!is.na(masked)) %>%
  separate(masked, c("blank", names(df_original)[-1]), "") %>%
  select(id, starts_with("t"))
#> # A tibble: 2 x 11
#>      id t1    t2    t3    t4    t5    t6    t7    t8    t9    t10  
#>   <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1     1 s     s     "w"   " "   " "   " "   " "   " "   " "   " "  
#> 2     4 e     w     " "   " "   " "   " "   " "   " "   " "   " "  


df2 %>% 
  filter(!is.na(masked)) %>%
  separate(masked, c("blank", names(df_original)[-1]), "") %>%
  select(id, starts_with("t"))
#> # A tibble: 2 x 11
#>      id t1    t2    t3    t4    t5    t6    t7    t8    t9    t10  
#>   <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1     3 " "   " "   " "   " "   " "   w     s     "s"   "s"   "r"  
#> 2     5 " "   " "   " "   " "   " "   w     r     " "   " "   " " 


来源:https://stackoverflow.com/questions/64186225/consecutive-occurrence-in-a-data-frame

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!