可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a huge .csv
file like this :
Transcript Id Gene Id(name) Mirna Name miTG score ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 UTR3 21:30717114-30717142 0.05994568 UTR3 21:30717414-30717442 0.13591267 ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1 UTR3 6:105526681-105526709 0.133514751
and I want to build a matrix like this from it :
Transcript Id Gene Id(name) Mirna Name miTG score UTR3 MRE_score ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717414-30717442 0.13591267
I want to add three new columns into my new matrix called UTR3
, MRE_score
and CDS
.
For every Gene ID
(for example ENST00000286800
), there are several UTR3
in the original matrix (here two UTR3
's for ENST00000286800
, and one UTR3
for ENST00000345080
) we choose the UTR3
with the highest score in the third column. In the new matrix, the value of UTR3
for every Gene ID
will be the value of UTR3
in the second column of the original matrix.
Can any body help me to reshape this data and build my new matrix?
回答1:
You could try to structure the CSV using regular expressions:
textfile <- "ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 UTR3 21:30717114-30717142 0.05994568 UTR3 21:30717414-30717442 0.13591267 ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1 UTR3 6:105526681-105526709 0.133514751" txt <- readLines(textConnection(textfile)) sepr <- grepl("^ENST.*", txt) r <- rle(sepr) r <- r$lengths[!r$values] regex <- "(\\S+)\\s+(\\S+)\\s(\\([^)]+\\)\\s+\\S+)\\s+(\\d+)" m <- regexec(regex, txt[sepr]) m1 <- as.data.frame(t(sapply(regmatches(txt[sepr], m), "[", 2:5))) m1 <- m1[rep(1:nrow(m1), r),] regex <- "(\\S+)\\s+(\\S+)\\s+(\\S+)" m <- regexec(regex, txt[!sepr]) m2 <- as.data.frame(t(sapply(regmatches(txt[!sepr], m), "[", 2:4))) df <- cbind(m1, m2[,-1]) names(df) <- c("Transcript Id", "Gene Id(name)", "Mirna Name", "miTG score", "UTR3", "MRE_score" ) rownames(df) <- NULL df # Transcript Id Gene Id(name) Mirna Name miTG score UTR3 MRE_score # 1 ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717114-30717142 0.05994568 # 2 ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717414-30717442 0.13591267 # 3 ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1 6:105526681-105526709 0.133514751
回答2:
Using this test data:
Lines <- " Transcript Id Gene Id(name) Mirna Name miTG score ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 UTR3 21:30717114-30717142 0.05994568 UTR3 21:30717414-30717442 0.13591267 ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1 UTR3 6:105526681-105526709 0.133514751"
read it all in and set the names, nms
for the output. Then calculate the grouping vector, cs
, using a cumulative sum. non-duplicates are the first row of each group and duplicates are the following rows. Merge these two sets of rows by group and extract out the highest MRE_score
in each group:
DF <- read.table(text = Lines, header = TRUE, fill = TRUE, as.is = TRUE, check.names = FALSE) nms <- c("cs", names(DF)[1:5], "UTR3", "MRE_score") # out will have these names DF$cs <- cumsum(!is.na(DF$Mirna)) # groups each ENST row with its UTR3 rows dup <- duplicated(DF$cs) # FALSE for ENST rows and TRUE for UTR3 rows both <- merge(DF[!dup, ], DF[dup, ], by = "cs")[c(1:6, 11:12)] # merge ENST & UTR3 rows names(both) <- nms both$MRE_score <- as.numeric(both$MRE_score) Rank <- function(x) rank(x, ties.method = "first") out <- both[ave(-both$MRE_score, both$cs, FUN = Rank) == 1, -1] # only keep largest score
Here we get:
> out Transcript Id Gene Id(name) Mirna UTR3 MRE_score 2 ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717414-30717442 0.1359127 3 ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1 6:105526681-105526709 0.1335148
Note that the question refers to a CDS
column but what it is is not described nor does it appear in the example output so we ignored it.