How to build data matrix from mixed and messy CSV file?

前端未结

关注

 2  1814

失恋的感觉 2021-01-28 01:11

I have a huge .csv file like this :

Transcript Id   Gene Id(name)   Mirna Name  miTG score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p   1


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   误落风尘
                                             
                
                
                (楼主)
            
              
              
                2021-01-28 01:56
              

            
            
                        
Using this test data:

Lines <- " Transcript Id   Gene Id(name)   Mirna Name  miTG score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p   1
UTR3    21:30717114-30717142    0.05994568  
UTR3    21:30717414-30717442    0.13591267  
ENST00000345080 ENSG00000187772 (LIN28B)    hsa-let-7a-5p   1
UTR3    6:105526681-105526709   0.133514751"


read it all in and set the names, nms for the output.  Then calculate the grouping vector, cs, using a cumulative sum.  non-duplicates are the first row of each group and duplicates are the following rows.  Merge these two sets of rows by group and extract out the highest MRE_score in each group:

DF <- read.table(text = Lines, header = TRUE, fill = TRUE, as.is = TRUE, 
         check.names = FALSE)
nms <- c("cs", names(DF)[1:5], "UTR3", "MRE_score") # out will have these names
DF$cs <- cumsum(!is.na(DF$Mirna)) # groups each ENST row with its UTR3 rows
dup <- duplicated(DF$cs) # FALSE for ENST rows and TRUE for UTR3 rows
both <- merge(DF[!dup, ], DF[dup, ], by = "cs")[c(1:6, 11:12)]  # merge ENST & UTR3 rows
names(both) <- nms
both$MRE_score <- as.numeric(both$MRE_score)
Rank <- function(x) rank(x, ties.method = "first")
out <- both[ave(-both$MRE_score, both$cs, FUN = Rank) == 1, -1] # only keep largest score


Here we get:

> out
       Transcript              Id     Gene      Id(name) Mirna                  UTR3 MRE_score
2 ENST00000286800 ENSG00000156273  (BACH1) hsa-let-7a-5p     1  21:30717414-30717442 0.1359127
3 ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p     1 6:105526681-105526709 0.1335148


Note that the question refers to a CDS column but what it is is not described nor does it appear in the example output so we ignored it.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复