Deleting reversed duplicates with R

前端 未结 3 1252
野的像风
野的像风 2020-11-27 22:32

I have a data frame in R that contains the gene ids of paralogous genes in Arabidopsis, looking something like this:

gene_x    gene_y
AT1            


        
3条回答
  •  臣服心动
    2020-11-27 22:58

    A dplyr possibility could be:

    mydf %>%
     group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
     slice(1) %>%
     ungroup() %>%
     select(-grp)
    
      gene_x gene_y
         
    1 AT1    AT2   
    2 AT1    AT3   
    3 AT3    AT4  
    

    Or:

    mydf %>%
     group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
     filter(row_number() == 1) %>%
     ungroup() %>%
     select(-grp)
    

    Or:

    mydf %>%
     group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
     distinct(grp, .keep_all = TRUE) %>%
     ungroup() %>%
     select(-grp)
    

    Or using dplyr and purrr:

    mydf %>%
     group_by(grp = paste(invoke(pmax, .), invoke(pmin, .), sep = "_")) %>%
     slice(1) %>%
     ungroup() %>%
     select(-grp)
    

    And as of purrr 0.3.0 invoke() is retired, exec() should be used instead:

    mydf %>%
     group_by(grp = paste(exec(pmax, !!!.), exec(pmin, !!!.), sep = "_")) %>%
     slice(1) %>%
     ungroup() %>%
     select(-grp)
    

    Or:

    df %>%
     rowwise() %>%
     mutate(grp = paste(sort(c(gene_x, gene_y)), collapse = "_")) %>%
     group_by(grp) %>%
     slice(1) %>%
     ungroup() %>%
     select(-grp)
    

提交回复
热议问题