可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a huge .csv file like this :

Transcript Id   Gene Id(name)   Mirna Name  miTG score ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p   1 UTR3    21:30717114-30717142    0.05994568   UTR3    21:30717414-30717442    0.13591267   ENST00000345080 ENSG00000187772 (LIN28B)    hsa-let-7a-5p   1 UTR3    6:105526681-105526709   0.133514751

and I want to build a matrix like this from it :

Transcript Id    Gene Id(name)   Mirna Name        miTG score    UTR3        MRE_score    ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p       1  21:30717414-30717442 0.13591267

I want to add three new columns into my new matrix called UTR3, MRE_score and CDS.

For every Gene ID (for example ENST00000286800), there are several UTR3 in the original matrix (here two UTR3's for ENST00000286800, and one UTR3 for ENST00000345080) we choose the UTR3 with the highest score in the third column. In the new matrix, the value of UTR3 for every Gene ID will be the value of UTR3 in the second column of the original matrix.

Can any body help me to reshape this data and build my new matrix?

回答1:

You could try to structure the CSV using regular expressions:

textfile <- "ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p   1 UTR3    21:30717114-30717142    0.05994568   UTR3    21:30717414-30717442    0.13591267   ENST00000345080 ENSG00000187772 (LIN28B)    hsa-let-7a-5p   1 UTR3    6:105526681-105526709   0.133514751" txt <- readLines(textConnection(textfile))  sepr <- grepl("^ENST.*", txt)  r <- rle(sepr) r <- r$lengths[!r$values]  regex <- "(\\S+)\\s+(\\S+)\\s(\\([^)]+\\)\\s+\\S+)\\s+(\\d+)" m <- regexec(regex, txt[sepr]) m1 <- as.data.frame(t(sapply(regmatches(txt[sepr], m), "[", 2:5))) m1 <- m1[rep(1:nrow(m1), r),]  regex <- "(\\S+)\\s+(\\S+)\\s+(\\S+)" m <- regexec(regex, txt[!sepr]) m2 <- as.data.frame(t(sapply(regmatches(txt[!sepr], m), "[", 2:4)))  df <- cbind(m1, m2[,-1]) names(df) <- c("Transcript Id",    "Gene Id(name)",   "Mirna Name",        "miTG score",    "UTR3",        "MRE_score"   ) rownames(df) <- NULL df # Transcript Id   Gene Id(name)                Mirna Name miTG score                  UTR3   MRE_score # 1 ENST00000286800 ENSG00000156273     (BACH1) hsa-let-7a-5p          1  21:30717114-30717142  0.05994568 # 2 ENST00000286800 ENSG00000156273     (BACH1) hsa-let-7a-5p          1  21:30717414-30717442  0.13591267 # 3 ENST00000345080 ENSG00000187772 (LIN28B)    hsa-let-7a-5p          1 6:105526681-105526709 0.133514751

回答2:

Using this test data:

Lines <- " Transcript Id   Gene Id(name)   Mirna Name  miTG score ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p   1 UTR3    21:30717114-30717142    0.05994568   UTR3    21:30717414-30717442    0.13591267   ENST00000345080 ENSG00000187772 (LIN28B)    hsa-let-7a-5p   1 UTR3    6:105526681-105526709   0.133514751"

read it all in and set the names, nms for the output. Then calculate the grouping vector, cs, using a cumulative sum. non-duplicates are the first row of each group and duplicates are the following rows. Merge these two sets of rows by group and extract out the highest MRE_score in each group:

DF <- read.table(text = Lines, header = TRUE, fill = TRUE, as.is = TRUE,           check.names = FALSE) nms <- c("cs", names(DF)[1:5], "UTR3", "MRE_score") # out will have these names DF$cs <- cumsum(!is.na(DF$Mirna)) # groups each ENST row with its UTR3 rows dup <- duplicated(DF$cs) # FALSE for ENST rows and TRUE for UTR3 rows both <- merge(DF[!dup, ], DF[dup, ], by = "cs")[c(1:6, 11:12)]  # merge ENST & UTR3 rows names(both) <- nms both$MRE_score <- as.numeric(both$MRE_score) Rank <- function(x) rank(x, ties.method = "first") out <- both[ave(-both$MRE_score, both$cs, FUN = Rank) == 1, -1] # only keep largest score

Here we get:

> out        Transcript              Id     Gene      Id(name) Mirna                  UTR3 MRE_score 2 ENST00000286800 ENSG00000156273  (BACH1) hsa-let-7a-5p     1  21:30717414-30717442 0.1359127 3 ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p     1 6:105526681-105526709 0.1335148

Note that the question refers to a CDS column but what it is is not described nor does it appear in the example output so we ignored it.

文章来源: How to build data matrix from mixed and messy CSV file?

标签

matrix