问题
The following problem:
I have the data frame data1 with a variable including several entries:
data1 <- data.frame(v1 = c("test, test, bird", "bird, bird", "car"))
Now I want to remove duplicated entries in each row. The result should look like this:
data1.final <- data.frame(v1 = c("test, bird", "bird", "car"))
I tried this:
data1$ID <- 1:nrow(data1)
data1$v1 <- as.character(data1$v1)
data1 <- split(data1, data1$ID)
reduce.words <- function(x) {
d <- unlist(strsplit(x$v1, split=" "))
d <- paste(d[-which(duplicated(d))], collapse = ' ')
x$v1 <- d
return(x)
}
data1 <- lapply(data1, reduce.words)
data1 <- as.data.frame(do.call(rbind, data1))
However, this yields empty rows, except the first one. Anyone an idea to solve this problem?
回答1:
You seem to have a rather complicated workflow. What about just creating a simple function that works on the rows
reduce_row = function(i) {
split = strsplit(i, split=", ")[[1]]
paste(unique(split), collapse = ", ")
}
and then using apply
data1$v2 = apply(data1, 1, reduce_row)
to get
R> data1
v1 v2
1 test, test, bird test, bird
2 bird, bird bird
3 car car
回答2:
Another option using cSplit
from splitstackshape
library(splitstackshape)
cSplit(cbind(data1, indx=1:nrow(data1)), 'v1', ', ', 'long')[,
toString(v1[!duplicated(v1)]),
by=indx][,indx:=NULL][]
# V1
#1: test, bird
#2: bird
#3: car
Or as @Ananda Mahto mentioned in the comments
unique(cSplit(as.data.table(data1, keep.rownames = TRUE),
"v1", ",", "long"))[, toString(v1), by = rn]
# rn V1
#1: 1 test, bird
#2: 2 bird
#3: 3 car
来源:https://stackoverflow.com/questions/27173948/remove-duplicated-string-in-a-row