How to remove duplicated values in uneven columns of a data.table?

问题

I want to remove duplicated values in each coulmn of an uneven data.table. For instance, if the original data is (the real data table has many columns and rows):

dt <- data.table(A = c("5p", "3p", "3p", "6y", NA), B = c("1c", "4r", "1c", NA, NA), C = c("4f", "5", "5", "5", "4m"))
> dt
      A    B  C
1:   5p   1c 4f
2:   3p   4r  5
3:   3p   1c  5
4:   6y <NA>  5
5: <NA> <NA> 4m

after removal of duplicated values in each column it should look like this:

A    B    C
5p   1c   4f
3p   4r   5
NA   NA   NA
6y   NA   NA
NA   NA   4m

I am trying a solution proposed in another thread using data.table. However, I only get the first duplicated value in each column replaced with "NA", but not the subsequents.

cols <- colnames(dt)
dt[, lapply(.SD, function(x) replace(x, anyDuplicated(x), NA)), .SDcols = cols]
> dt
      A    B    C
1:   5p   1c   4f
2:   3p   4r    5
3: <NA> <NA> <NA>
4:   6y <NA>    5
5: <NA> <NA>   4m

How should I modify the code to get all duplicates replaced?

回答1:

You were very close. Instead of using anyDuplicated, I used duplicated like this:

dt[, lapply(.SD, function(x) ifelse(duplicated(x) == TRUE, NA, x))]

Try dt[, lapply(.SD, duplicated)] to get an idea of what the ifelse will do.

回答2:

I believe this would be the proper data.table way of achieving this task:

cols <- colnames(dt)
dt[, (cols) := lapply(.SD, function(x) replace(x, duplicated(x), NA))]

      A    B    C
1:   5p   1c   4f
2:   3p   4r    5
3: <NA> <NA> <NA>
4:   6y <NA> <NA>
5: <NA> <NA>   4m

Note:

.SD defaults to all columns, so there in this case there is no need to specify the .SDcols argument.
Using := avoids copying the whole data.table.

来源：https://stackoverflow.com/questions/59771098/how-to-remove-duplicated-values-in-uneven-columns-of-a-data-table

标签

duplicates

data.table