fit model to multiple groupings or subsets and extract original factor columns for data frame output

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-11 11:13:53

问题


I want to fit models and pull out specific parameters split by grouping factors (fac1 and fac2 below) or subsets. My problem is that when sapply outputs the correct parameters, I'm stuck with a list where the elements are named as combinations. What I want to get is a data.frame where I have a column for each factor with the appropriate label. I want to do this in base R.

Notice, the answer needs to be general and not for the specific names used in this case. The answer shouldn't be hindered if factor names include 'periods.' I'm eventually making something to use with any data, so this answer needs to do so, and also with any number of factors. I am actually using a custom function on a much larger data set but this example represents my issue.

Following is reproducible code:

#create data
fac1 <- c(rep("A", 10), rep("B",10))
fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)
x <- rep(1:5,4)
set.seed(1337)
y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)

xy <- data.frame(x,y) #bind parameters for regression

factors <- list(fac1, fac2) #split by 2 factors

sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) 
#run regression by these 4 groups, pull out slope

The output is:

A.X.c$x  B.X.c$x  A.Y.c$x  B.Y.c$x 
1.861290 2.131431 1.590733 1.746169

What I want is:

fac1 fac2 slope
A    X    1.861290 
B    X    2.131431 
A    Y    1.590733 
B    Y    1.746169

The following code might be made to be more general to accomplish this, but I'm worried about cases where expand.grid makes all possible combinations but the user has missing combinations in their data, and also whether the order will stay the same. Does expand.grid use a similar method as however split subsets the data that determines the order of the returned values?

slopes <- sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) 

dataframeplz <- as.data.frame(expand.grid(unique(fac1), unique(fac2))) 

dataframeplz$slope <- slopes

dataframeplz

Here is the plyr solution if that helps. It's so easy but not base R. Anyone know where in Hadley's code this magic happens? Githubbers?

library("plyr")
neatdata <- data.frame(fac1,fac2,x,y)
ddply(neatdata, c("fac1", "fac2"), function(c) coef(lm(c$y~c$x))[2])

回答1:


For base R, aggregate is the user friendly function for such situations.

aggregate(cbind(slope=1:nrow(xy))~fac1+fac2,FUN=function(r) coef(lm(y~x,data=xy[r,]))[2])
  fac1 fac2    slope
1    A    X 1.861290
2    B    X 2.131431
3    A    Y 1.590733
4    B    Y 1.746169

This could also be done with by in a fashion a bit more similar to your original.

setNames(as.data.frame.table(
  by(xy,list(fac1,fac2),FUN=function(c) coef(lm(c$y~c$x))[2])),
  c("fac1","fac2","slope"))



回答2:


I used base R and I focused on your specific example. This process has limitations as it handles column names as strings and keeps the appropriate info you need.

#create data
fac1 <- c(rep("A", 10), rep("B",10))
fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)
x <- rep(1:5,4)
set.seed(1337)
y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)

xy <- data.frame(x,y) #bind parameters for regression

factors <- list(fac1, fac2) #split by 2 factors

dt_res = sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) #run regression by these 4 groups, pull out slope

dt_res

# A.X.c$x  B.X.c$x  A.Y.c$x  B.Y.c$x 
# 1.861290 2.131431 1.590733 1.746169


dt_res = data.frame(dt_res)
dt_res = data.frame(names=rownames(dt_res),   # save the names as a column
                    slope=dt_res$dt_res,
                    row.names = NULL)

dt_res$names = gsub(".c[$]x","",dt_res$names)  # delete the unecessary characters from names
dt_res$fac1 = substr(dt_res$names,1,1)       # pick first character
dt_res$fac2 = substr(dt_res$names,3,3)       # pick 3rd character
dt_res[,c("fac1","fac2","slope")]

#    fac1 fac2    slope
# 1    A    X 1.861290
# 2    B    X 2.131431
# 3    A    Y 1.590733
# 4    B    Y 1.746169

I've updated it to something more general:

  #create data
fac1 <- c(rep("A", 10), rep("B",10))
fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)
x <- rep(1:5,4)
set.seed(1337)
y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)

xy <- data.frame(x,y) #bind parameters for regression

factors <- list(fac1, fac2) #split by 2 factors

res = sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) #run regression by these 4 groups, pull out slope

# split names by . (that will be the split symbol always)
    names = strsplit(names(split(xy, factors)), split ="[.]")

# create empty vectors to store names
fac1 = vector()
fac2 = vector()

for (i in 1:length(names)){

# iterate through the list of names and keep values from the corresponding position
  fac1 = c(fac1, names[[i]][1])
  fac2 = c(fac2, names[[i]][2])
}


data.frame(fac1,
           fac2,
           slope = res,
           row.names = NULL)



回答3:


A. Webb's answer is more elegant, but this lapply/arbitrary function/do.call/rbind workflow has been my last resort for this kind of thing for years:

# Move the factors inside your data frame, so they'll be available after the split()
xy <- data.frame(x, y, fac1, fac2)

# Iterate over the split
reglist <- lapply(split(xy, factors), FUN = function(group) {

    # Get the current factor levels
    group_levels <- unique(group[c("fac1", "fac2")])

    # Mash it all into a data.frame
    data.frame(group_levels, slope = coef(lm(y ~ x, data = group))[2])

})

# Collapse the list into a data.frame
do.call("rbind", reglist)


来源:https://stackoverflow.com/questions/32119184/fit-model-to-multiple-groupings-or-subsets-and-extract-original-factor-columns-f

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!