Remove Duplicated String in a Row

问题

The following problem:

I have the data frame data1 with a variable including several entries:

data1 <- data.frame(v1 = c("test, test, bird", "bird, bird", "car"))

Now I want to remove duplicated entries in each row. The result should look like this:

data1.final <- data.frame(v1 = c("test, bird", "bird", "car"))

I tried this:

data1$ID <- 1:nrow(data1)
data1$v1 <- as.character(data1$v1)

data1 <- split(data1, data1$ID)
reduce.words <- function(x) {
  d <- unlist(strsplit(x$v1, split=" "))
  d <- paste(d[-which(duplicated(d))], collapse = ' ')
  x$v1 <- d 
  return(x)
}
data1 <- lapply(data1, reduce.words)
data1 <- as.data.frame(do.call(rbind, data1))

However, this yields empty rows, except the first one. Anyone an idea to solve this problem?

回答1:

You seem to have a rather complicated workflow. What about just creating a simple function that works on the rows

reduce_row = function(i) {
  split = strsplit(i, split=", ")[[1]]
  paste(unique(split), collapse = ", ") 
}

and then using apply

data1$v2 = apply(data1, 1, reduce_row)

to get

R> data1
                v1         v2
1 test, test, bird test, bird
2       bird, bird       bird
3              car        car

回答2:

Another option using cSplit from splitstackshape

library(splitstackshape)
cSplit(cbind(data1, indx=1:nrow(data1)), 'v1', ', ', 'long')[,
        toString(v1[!duplicated(v1)]), 
                                  by=indx][,indx:=NULL][]
  #          V1
  #1: test, bird
  #2:       bird
  #3:        car

Or as @Ananda Mahto mentioned in the comments

 unique(cSplit(as.data.table(data1, keep.rownames = TRUE),
                    "v1", ",", "long"))[, toString(v1), by = rn]

 #   rn         V1
 #1:  1 test, bird
 #2:  2       bird
 #3:  3        car

来源：https://stackoverflow.com/questions/27173948/remove-duplicated-string-in-a-row

标签

duplicates

collapse