Extract and paste together multiple columns of a data frame like object using a vector of column names

淺唱寂寞╮ 提交于 2019-12-13 09:34:04

问题


I have an object (variable rld) which looks a bit like a "data.frame" (see further down the post for details) in that it has columns that can be accessed using $ or [[]].

I have a vector groups containing names of some of its columns (3 in example below).

I generate strings based on combinations of elements in the columns as follows:

paste(rld[[groups[1]]], rld[[groups[2]]], rld[[groups[3]]], sep="-")

I would like to generalize this so that I don't need to know how many elements are in groups.

The following attempt fails:

> paste(rld[[groups]], collapse="-")
Error in normalizeDoubleBracketSubscript(i, x, exact = exact, error.if.nomatch = FALSE) : 
  attempt to extract more than one element

Here is how I would do in functional-style with a python dictionary:

map("-".join, zip(*map(rld.get, groups)))

Is there a similar column-getter operator in R ?


As suggested in the comments, here is the output of dput(rld): http://paste.ubuntu.com/23528168/ (I could not paste it directly, since it is huge.)

This was generated using the DESeq2 bioinformatics package, and more precisely, doing something similar to what is described page 28 of this document: https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf.

DESeq2 can be installed from bioconductor as follows:

source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")

Reproducible example

One of the solutions worked when running in interactive mode, but failed when the code was put in a library function, with the following error:

Error in do.call(function(...) paste(..., sep = "-"), colData(rld)[groups]) : 
  second argument must be a list

After some tests, it appears that the problem doesn't occur if the function is in the main calling script, as follows:

library(DESeq2)
library(test.package)

lib_names <- c(
    "WT_1",
    "mut_1",
    "WT_2",
    "mut_2",
    "WT_3",
    "mut_3"
)
file_names <- paste(
    lib_names,
    "txt",
    sep="."
)

wt <- "WT"
mut <- "mut"
genotypes <- rep(c(wt, mut), times=3)
replicates <- c(rep("1", times=2), rep("2", times=2), rep("3", times=2))

sample_table = data.frame(
    lib = lib_names,
    file_name = file_names,
    genotype = genotypes,
    replicate = replicates
)

dds_raw <- DESeqDataSetFromHTSeqCount(
    sampleTable = sample_table,
    directory = ".",
    design = ~ genotype
    )

# Remove genes with too few read counts
dds <- dds_raw[ rowSums(counts(dds_raw)) > 1, ]
dds$group <- factor(dds$genotype)
design(dds) <- ~ replicate + group
dds <- DESeq(dds)

test_do_paste <- function(dds) {
    require(DESeq2)
    groups <- head(colnames(colData(dds)), -2)
    rld <- rlog(dds, blind=F)
    stopifnot(all(groups %in% names(colData(rld))))
    combined_names <- do.call(
        function (...) paste(..., sep = "-"),
        colData(rld)[groups]
    )
    print(combined_names)
}

test_do_paste(dds)
# This fails (with the same function put in a package)
#test.package::test_do_paste(dds)

The error occurs when the function is packaged as in https://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/

Data used in the example:

  • WT_1.txt

  • WT_2.txt

  • WT_3.txt

  • mut_1.txt

  • mut_2.txt

  • mut_3.txt

I posted this issue as a separate question: do.call error "second argument must be a list" with S4Vectors when the code is in a library

Although I have an answer to my initial question, I'm still interested in alternative solutions for the "column extraction using a vector of column names" issue.


回答1:


We may use either of the following:

do.call(function (...) paste(..., sep = "-"), rld[groups])
do.call(paste, c(rld[groups], sep = "-"))

We can consider a small, reproducible example:

rld <- mtcars[1:5, ]
groups <- names(mtcars)[c(1,3,5,6,8)]
do.call(paste, c(rld[groups], sep = "-"))
#[1] "21-160-3.9-2.62-0"     "21-160-3.9-2.875-0"    "22.8-108-3.85-2.32-1" 
#[4] "21.4-258-3.08-3.215-1" "18.7-360-3.15-3.44-0"

Note, it is your responsibility to ensure all(groups %in% names(rld)) is TRUE, otherwise you get "subscript out of bound" or "undefined column selected" error.


(I am copying your comment as a follow-up)

It seems the methods you propose don't work directly on my object. However, the package I'm using provides a colData function that makes something more similar to a data.frame:

> class(colData(rld))
[1] "DataFrame"
attr(,"package")
[1] "S4Vectors"

do.call(function (...) paste(..., sep = "-"), colData(rld)[groups]) works, but do.call(paste, c(colData(rld)[groups], sep = "-")) fails with an error message I fail to understand (as too often with R...):

> do.call(paste, c(colData(rld)[groups], sep = "-"))
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘mcols’ for signature ‘"character"’


来源:https://stackoverflow.com/questions/40789601/extract-and-paste-together-multiple-columns-of-a-data-frame-like-object-using-a

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!