问题
I have a toy edgelist for Neural Networking that looks like this:
df<-c("Group1", "Group1", "Group2", "Group1, Group3", "Group1, Group3",
"Group3", "Group3, Group4", "Group3, Group4")
V1
1 Group1
2 Group1
3 Group2
4 Group1, Group3
5 Group1, Group3
6 Group3
7 Group3, Group4
8 Group3, Group4
I need to preserve the 8-row structure of the data (with the individual duplicate elements like Group1
in rows 1 & 2), but I want to:
1) Identify instances of duplicate entries that are delimited by a comma (i.e. "Group1, Group3"
and "Group3, Group4"
)
2) For these instances, find a way to merge the values so one unique value is left in the first duplicate row, and the second unique value is left in the second duplicate row, as so:
V1
1 Group1
2 Group1
3 Group2
4 Group1 <- Group3 is dropped
5 Group3 <- Group1 is dropped
6 Group3
7 Group3 <- Group4 is dropped
8 Group4 <- Group3 is dropped
All of the duplicates occur in multiples of two, so there aren't any issues with an odd number of repetitions with only two values, etc.
EDIT:
For future reference, what could I do if the edgelist had non-sequential duplicates like so:
df<-c("Group1", "Group1, Group3", "Group2", "Group1, Group3", "Group3",
"Group3, Group4", "Group3", "Group3, Group4")
V1
1 Group1
2 Group1, Group3
3 Group2
4 Group1, Group3
5 Group3
6 Group3, Group4
7 Group3
8 Group3, Group4
The solutions offered wouldn't be able to work work in this situation. Also, since the position of the rows is crucial for networking, it can't be sorted. Any suggestions?
回答1:
Remove duplicates and then split at comma.
unlist(strsplit(df[!(ave(seq_along(df), df, FUN = seq_along) == 2 & grepl(",", df))], ", "))
#[1] "Group1" "Group1" "Group2" "Group1" "Group3" "Group3" "Group3" "Group4"
df
may need to be sorted first if there is a chance duplicates won't be together.
Here's another approach using mapply
that should work regardless of the order of df
df<-c("Group1", "Group1, Group3", "Group2", "Group1, Group3", "Group3",
"Group3, Group4", "Group3", "Group3, Group4")
d = lapply(unique(df), function(x) strsplit(x, ", ?"))
ind = match(df, unique(df))
grp = ifelse(grepl(",", df), ave(seq_along(df), df, FUN = seq_along), 1)
df2 = mapply(function(i, g) d[[i]][[1]][g], ind, grp)
data.frame(df, df2)
#> df df2
#> 1 Group1 Group1
#> 2 Group1, Group3 Group1
#> 3 Group2 Group2
#> 4 Group1, Group3 Group3
#> 5 Group3 Group3
#> 6 Group3, Group4 Group3
#> 7 Group3 Group3
#> 8 Group3, Group4 Group4
回答2:
Using tidyverse functions.
df_t <- data.frame(V1 = df)
df_t %>%
dplyr::group_by(V1) %>%
dplyr::filter(!(row_number() == 2 & str_detect(V1,","))) %>%
dplyr::ungroup()
tidyr::separate_rows(V1)
# A tibble: 8 x 1
V1
* <chr>
1 Group1
2 Group1
3 Group2
4 Group1
5 Group3
6 Group3
7 Group3
8 Group4
回答3:
Another option with rowid
library(data.table)
library(stringr)
data.table(V1 = df)[!(rowid(V1) == 2 & str_detect(V1, ",")),
.(V1 = unlist(strsplit(V1, ", ")))]
# V1
#1: Group1
#2: Group1
#3: Group2
#4: Group1
#5: Group3
#6: Group3
#7: Group3
#8: Group4
Or using tidyverse
library(dplyr)
library(tidyr)
tibble(V1 = df) %>%
filter(!duplicated(case_when(str_detect(V1, ',') ~ V1,
TRUE ~ make.unique(V1)))) %>%
separate_rows(V1)
来源:https://stackoverflow.com/questions/58329390/merge-duplicate-characters-in-r-while-preserving-data-frame-structure