I have a huge .csv
file like this :
Transcript Id Gene Id(name) Mirna Name miTG score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1
Using this test data:
Lines <- " Transcript Id Gene Id(name) Mirna Name miTG score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1
UTR3 21:30717114-30717142 0.05994568
UTR3 21:30717414-30717442 0.13591267
ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1
UTR3 6:105526681-105526709 0.133514751"
read it all in and set the names, nms
for the output. Then calculate the grouping vector, cs
, using a cumulative sum. non-duplicates are the first row of each group and duplicates are the following rows. Merge these two sets of rows by group and extract out the highest MRE_score
in each group:
DF <- read.table(text = Lines, header = TRUE, fill = TRUE, as.is = TRUE,
check.names = FALSE)
nms <- c("cs", names(DF)[1:5], "UTR3", "MRE_score") # out will have these names
DF$cs <- cumsum(!is.na(DF$Mirna)) # groups each ENST row with its UTR3 rows
dup <- duplicated(DF$cs) # FALSE for ENST rows and TRUE for UTR3 rows
both <- merge(DF[!dup, ], DF[dup, ], by = "cs")[c(1:6, 11:12)] # merge ENST & UTR3 rows
names(both) <- nms
both$MRE_score <- as.numeric(both$MRE_score)
Rank <- function(x) rank(x, ties.method = "first")
out <- both[ave(-both$MRE_score, both$cs, FUN = Rank) == 1, -1] # only keep largest score
Here we get:
> out
Transcript Id Gene Id(name) Mirna UTR3 MRE_score
2 ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717414-30717442 0.1359127
3 ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1 6:105526681-105526709 0.1335148
Note that the question refers to a CDS
column but what it is is not described nor does it appear in the example output so we ignored it.