Remove duplicate values across a few columns but keep rows

陌路散爱 提交于 2021-02-19 05:14:05

问题


I have a dataframe that looks like this:

dat <- data.frame(id=1:6,
                  z_1=c(100,290,38,129,0,290),
                  z_2=c(20,0,0,0,0,290),
                  z_3=c(0,0,38,0,0,98),
                  z_4=c(0,0,38,127,38,78),
                  z_5=c(23,0,25,0,0,98),
                  z_6=c(100,0,25,127,0,9))

dat

  id z_1 z_2 z_3 z_4 z_5 z_6
1  1 100 20  0   0   23  100
2  2 290  0  0   0   0   0
3  3  38  0  38  38  25  25
4  4 129  0  0   127 0   127
5  5   0  0  0   38  0   0
6  6 290 290 98  78  98  9

I want to remove duplicate values of z_x across each row, replacing any duplicates with either a 0 or NA, but leaving the rows & columns intact (ie not dropping any). The 0s here do not count as duplicates, they are missing values. Duplicate values within a column are ok. My ideal output would look like this:

   id z_1 z_2 z_3 z_4 z_5 z_6
1  1  100 20  0   0   23  0
2  2  290 0   0   0   0   0
3  3  38  0   0   0   25  0
4  4  129 0   0   127 0   0
5  5   0  0   0   38  0   0
6  6  290 0   98  78  0   9

I don't really care what order the values within the z_xs appear in, so it's fine if they get moved around. Is there an efficient way to do this, preferably in some tidyverse way? I know I can pivot longer and drop duplicate rows, but my dataset is very large and I'm looking for a way to do this without pivoting.


回答1:


Base R way using apply :

cols <- grep('z_\\d+', names(dat))
dat[cols] <- t(apply(dat[cols], 1, function(x)  replace(x, duplicated(x), 0)))

#  id z_1 z_2 z_3 z_4 z_5 z_6
#1  1 100  20   0   0  23   0
#2  2 290   0   0   0   0   0
#3  3  38   0   0   0  25   0
#4  4 129   0   0 127   0   0
#5  5   0   0   0  38   0   0
#6  6 290   0  98  78   0   9

tidyverse way without reshaping can be done using pmap :

library(tidyverse)

dat %>%
  mutate(result = pmap(select(., matches('z_\\d+')), ~{
    x <- c(...)
    replace(x, duplicated(x), 0)
    })) %>%
  select(id, result) %>%
  unnest_wider(result)

Since tests performed by @thelatemail suggests reshaping is a better option than handling the data rowwise you might want to consider it.

dat %>%
  pivot_longer(cols = matches('z_\\d+')) %>%
  group_by(id) %>%
  mutate(value = replace(value, duplicated(value), 0)) %>%
  pivot_wider()



回答2:


This solution isn't tidyverse, but hopefully is sufficiently simple.

The duplicated() function does what you want. You can use apply() function to feed duplicated() your data by row.

dat[t(apply(dat, MARGIN = 1, duplicated))] <- 0


来源:https://stackoverflow.com/questions/66234422/remove-duplicate-values-across-a-few-columns-but-keep-rows

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!