Deleting reversed duplicates with R

前端 未结 3 1245
野的像风
野的像风 2020-11-27 22:32

I have a data frame in R that contains the gene ids of paralogous genes in Arabidopsis, looking something like this:

gene_x    gene_y
AT1            


        
相关标签:
3条回答
  • 2020-11-27 22:58
    mydf <- read.table(text="gene_x    gene_y
    AT1       AT2
    AT3       AT4
    AT1       AT2
    AT1       AT3
    AT2       AT1", header=TRUE, stringsAsFactors=FALSE)
    

    Here's one strategy using apply, sort, paste, and duplicated:

    mydf[!duplicated(apply(mydf,1,function(x) paste(sort(x),collapse=''))),]
      gene_x gene_y
    1    AT1    AT2
    2    AT3    AT4
    4    AT1    AT3
    

    And here's a slightly different solution:

    mydf[!duplicated(lapply(as.data.frame(t(mydf), stringsAsFactors=FALSE), sort)),]
      gene_x gene_y
    1    AT1    AT2
    2    AT3    AT4
    4    AT1    AT3
    
    0 讨论(0)
  • 2020-11-27 22:58

    A dplyr possibility could be:

    mydf %>%
     group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
     slice(1) %>%
     ungroup() %>%
     select(-grp)
    
      gene_x gene_y
      <chr>  <chr> 
    1 AT1    AT2   
    2 AT1    AT3   
    3 AT3    AT4  
    

    Or:

    mydf %>%
     group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
     filter(row_number() == 1) %>%
     ungroup() %>%
     select(-grp)
    

    Or:

    mydf %>%
     group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
     distinct(grp, .keep_all = TRUE) %>%
     ungroup() %>%
     select(-grp)
    

    Or using dplyr and purrr:

    mydf %>%
     group_by(grp = paste(invoke(pmax, .), invoke(pmin, .), sep = "_")) %>%
     slice(1) %>%
     ungroup() %>%
     select(-grp)
    

    And as of purrr 0.3.0 invoke() is retired, exec() should be used instead:

    mydf %>%
     group_by(grp = paste(exec(pmax, !!!.), exec(pmin, !!!.), sep = "_")) %>%
     slice(1) %>%
     ungroup() %>%
     select(-grp)
    

    Or:

    df %>%
     rowwise() %>%
     mutate(grp = paste(sort(c(gene_x, gene_y)), collapse = "_")) %>%
     group_by(grp) %>%
     slice(1) %>%
     ungroup() %>%
     select(-grp)
    
    0 讨论(0)
  • 2020-11-27 23:12

    Another tidyverse-centric approach but using purrr:

    library(tidyverse)
    
    c_sort_collapse <- function(...){
      c(...) %>% 
        sort() %>% 
        str_c(collapse = ".")
    }
    
    mydf %>% 
      mutate(x_y = map2_chr(gene_x, gene_y, c_sort_collapse)) %>% 
      distinct(x_y, .keep_all = TRUE) %>% 
      select(-x_y)
    #>   gene_x gene_y
    #> 1    AT1    AT2
    #> 2    AT3    AT4
    #> 3    AT1    AT3
    
    0 讨论(0)
提交回复
热议问题