问题
I'd like to collect terms under multiple columns of the annot data.frame
.
Below is the first row of information for a toy datset for annot.
colnames(annot)
# [1] "HUGO.Name" "Common.Name" "Gene.Class" "Cell.Type" "Annotation"
annot[1,]
# HUGO.Name Common.Name Gene.Class Cell.Type
# 1 CCL1 CCL1 Immune Response - Cell Type specific aDC
# Annotation
# 1 Cell Type specific, Chemokines and receptors, Inflammatory response
So far, I've been writing the colnames
iteratively, but I'd like to learn how to write a function to loop through all columns of annot (and more generally other data.frames
).
This is my manual approach:
yA <- unique(str_trim(unlist(strsplit(annot[, "Annotation"], ","))))
yC <- unique(str_trim(unlist(strsplit(annot[, "Cell.Type"], ","))))
yA
# [1] "Cell Type specific" "Chemokines and receptors"
# [3] "Inflammatory response" "Cytokines and receptors"
# [5] "Chronic inflammatory response" "Th2 orientation"
# [7] "T-cell proliferation" "Defense response to virus"
# [9] "B-cell receptor signaling pathway" "CD molecules"
# [11] "Regulation of immune response" "Adaptive immune response"
# [13] "Antigen processing and presentation"
How can I construct a function "y" to simplify this process? I've tried the following:
y <- function (i,n) {unique(str_trim(unlist(strsplit(i[, as.name(n)], ","))))}
However, I get an error when I try to use this function.
yA <- y(annot, Annotation)
# Error in .subset(x, j) : invalid subscript type 'symbol'
# Called from: `[.data.frame`(i, , as.name(n))
What I intend is to use the output of yA and yC to make lists as follows:
# look up associated HUGO.Name per each term of yA
for (i in yA) {
eval(call("<-", as.name(i),
annot[grepl(i, annot[,"Annotation"], fixed =T), "HUGO.Name"]))
}
# make lists
nSannot_list<- mget(yA)
回答1:
Let's assume you're starting with something like this as your data.frame
:
mydf <- data.frame(
v1 = c("A, B, B", "A, C,D"),
v2 = c("E, F", " G,H , E, I"),
v3 = c("J,K,L,M", "N, J, L, M, K"))
mydf
# v1 v2 v3
# 1 A, B, B E, F J,K,L,M
# 2 A, C,D G,H , E, I N, J, L, M, K
One way you can define your function would be like the following. I've stuck to base functions, but you can use "stringr" if you prefer.
myFun <- function(instring) {
if (!is.character(instring)) instring <- as.character(instring)
unique(trimws(unlist(strsplit(instring, ",", fixed = TRUE))))
}
The first line just checks to see if the input is a character string or not. Often, in data.frame
s, data is read in with stringsAsFactors = TRUE
by default, so you need to perform that conversion first. The second line does the splitting and trimming. I've added a fixed = TRUE
in there for efficiency.
Once you have such a function, you can easily apply it using apply
(for a data.frame
or a matrix
, either by row or by column) or using lapply
(for a list
or a data.frame
(which would be by column)).
## If `mydf` is a data.frame, and you want to go by columns
lapply(mydf, myFun)
# $v1
# [1] "A" "B" "C" "D"
#
# $v2
# [1] "E" "F" "G" "H" "I"
#
# $v3
# [1] "J" "K" "L" "M" "N"
## `apply` can be used too. Second argument specifies whether by row or column
apply(mydf, 1, myFun)
apply(mydf, 2, myFun)
If, on the other hand, you are looking to create a function that accepts the input dataset name and the (bare, unquoted) column, you can write your function like this:
myOtherFun <- function(indf, col) {
col <- deparse(substitute(col))
unique(trimws(unlist(strsplit(as.character(indf[, col]), ",", TRUE))))
}
The first line captures the bare column name as a character string so that it could be used in the typical my_data[, "col_wanted"]
form.
Here's the function in use:
myOtherFun(mydf, v2)
# [1] "E" "F" "G" "H" "I"
来源:https://stackoverflow.com/questions/34262313/constructing-a-function-using-colnames-as-variables