Merge duplicate characters in R while preserving data frame structure

时光怂恿深爱的人放手 提交于 2019-12-10 20:19:48

问题


I have a toy edgelist for Neural Networking that looks like this:

df<-c("Group1", "Group1", "Group2", "Group1, Group3", "Group1, Group3", 
"Group3", "Group3, Group4", "Group3, Group4")

    V1
1   Group1
2   Group1
3   Group2
4   Group1, Group3
5   Group1, Group3
6   Group3
7   Group3, Group4
8   Group3, Group4

I need to preserve the 8-row structure of the data (with the individual duplicate elements like Group1 in rows 1 & 2), but I want to:

1) Identify instances of duplicate entries that are delimited by a comma (i.e. "Group1, Group3" and "Group3, Group4")

2) For these instances, find a way to merge the values so one unique value is left in the first duplicate row, and the second unique value is left in the second duplicate row, as so:

    V1
1   Group1
2   Group1
3   Group2
4   Group1 <- Group3 is dropped
5   Group3 <- Group1 is dropped
6   Group3
7   Group3 <- Group4 is dropped
8   Group4 <- Group3 is dropped

All of the duplicates occur in multiples of two, so there aren't any issues with an odd number of repetitions with only two values, etc.

EDIT:

For future reference, what could I do if the edgelist had non-sequential duplicates like so:

df<-c("Group1", "Group1, Group3", "Group2", "Group1, Group3", "Group3", 
      "Group3, Group4", "Group3", "Group3, Group4")
    V1
1   Group1
2   Group1, Group3
3   Group2
4   Group1, Group3
5   Group3
6   Group3, Group4
7   Group3
8   Group3, Group4

The solutions offered wouldn't be able to work work in this situation. Also, since the position of the rows is crucial for networking, it can't be sorted. Any suggestions?


回答1:


Remove duplicates and then split at comma.

unlist(strsplit(df[!(ave(seq_along(df), df, FUN = seq_along) == 2 & grepl(",", df))], ", "))
#[1] "Group1" "Group1" "Group2" "Group1" "Group3" "Group3" "Group3" "Group4"

df may need to be sorted first if there is a chance duplicates won't be together.

Here's another approach using mapply that should work regardless of the order of df

df<-c("Group1", "Group1, Group3", "Group2", "Group1, Group3", "Group3", 
      "Group3, Group4", "Group3", "Group3, Group4")
d = lapply(unique(df), function(x) strsplit(x, ", ?"))
ind = match(df, unique(df))
grp = ifelse(grepl(",", df), ave(seq_along(df), df, FUN = seq_along), 1)
df2 = mapply(function(i, g) d[[i]][[1]][g], ind, grp)
data.frame(df, df2)
#>               df    df2
#> 1         Group1 Group1
#> 2 Group1, Group3 Group1
#> 3         Group2 Group2
#> 4 Group1, Group3 Group3
#> 5         Group3 Group3
#> 6 Group3, Group4 Group3
#> 7         Group3 Group3
#> 8 Group3, Group4 Group4



回答2:


Using tidyverse functions.

df_t <- data.frame(V1 = df)


df_t %>%
    dplyr::group_by(V1) %>%
    dplyr::filter(!(row_number() == 2 & str_detect(V1,","))) %>%
    dplyr::ungroup()
    tidyr::separate_rows(V1)
# A tibble: 8 x 1
  V1    
* <chr> 
1 Group1
2 Group1
3 Group2
4 Group1
5 Group3
6 Group3
7 Group3
8 Group4



回答3:


Another option with rowid

library(data.table)
library(stringr)
data.table(V1 = df)[!(rowid(V1) == 2 & str_detect(V1, ",")),
          .(V1 = unlist(strsplit(V1, ", ")))]
#   V1
#1: Group1
#2: Group1
#3: Group2
#4: Group1
#5: Group3
#6: Group3
#7: Group3
#8: Group4

Or using tidyverse

library(dplyr)
library(tidyr)
tibble(V1 = df) %>%
   filter(!duplicated(case_when(str_detect(V1, ',') ~ V1,
       TRUE ~ make.unique(V1)))) %>%
   separate_rows(V1)


来源:https://stackoverflow.com/questions/58329390/merge-duplicate-characters-in-r-while-preserving-data-frame-structure

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!