linear-regression

Place results of predict() in a for loop inside a list

可紊 提交于 2020-01-05 05:43:12
问题 Let us say I want to run the linear regression model on the mtcars dataset several times on different samples. The idea is, for each iteration in a for loop, to store the results of the predict() method every time the linear regression is run for a different sample. The small example follows for one run: ## Perform model once on a Sample and use model on full dataset: Sample_Size <- 10 Sample <- mtcars[sample(nrow(mtcars), Sample_Size), ] Model <- lm(formula = mpg ~ wt, data = Sample)

Looping through many multiple regressions

点点圈 提交于 2020-01-05 03:51:11
问题 I am trying to run this code from this post: looping with iterations over two lists of variables for a multiple regression in R with modified variable and data frame names, because it seems to do exactly what I want and uses a very similar dataset. However, it keeps giving me an error and I don't know why, so I would really appreciate if someone could help me to understand the error or the corresponding line of code so I could try to figure out what's wrong. for(i in 1:n) { vars = names

How to make group_by and lm fast?

限于喜欢 提交于 2020-01-04 21:33:34
问题 This is a sample. df <- tibble( subject = rep(letters[1:7], c(5, 6, 7, 5, 2, 5, 2)), day = c(3:7, 2:7, 1:7, 3:7, 6:7, 3:7, 6:7), x1 = runif(32), x2 = rpois(32, 3), x3 = rnorm(32), x4 = rnorm(32, 1, 5)) df %>% group_by(subject) %>% summarise( coef_x1 = lm(x1 ~ day)$coefficients[2], coef_x2 = lm(x2 ~ day)$coefficients[2], coef_x3 = lm(x3 ~ day)$coefficients[2], coef_x4 = lm(x4 ~ day)$coefficients[2]) This data is small, so performance is not problem. But my data is so large, approximately 1,000

How to drop insignificant categorical interaction terms Python StatsModel

别说谁变了你拦得住时间么 提交于 2020-01-04 12:56:32
问题 In stats model it's easy to add interaction term. However not all of the interactions are significant. My question is how to drop those that are insignificant? For example airport at Kootenay. # -*- coding: utf-8 -*- import pandas as pd import statsmodels.formula.api as sm if __name__ == "__main__": # Read data census_subdivision_without_lower_mainland_and_van_island = pd.read_csv('../data/augmented/census_subdivision_without_lower_mainland_and_van_island.csv') # Fit all data fit = sm.ols

Broken stick (or piecewise) regression with 2 breakpoints

孤街醉人 提交于 2020-01-04 05:47:08
问题 I want to estimate two breakpoints of a function with the next data: df = data.frame (x = 1:180, y = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 2, 2, 4, 2, 2, 3, 2, 1, 2,0, 1, 0, 1, 4, 0, 1, 2, 3, 1, 1, 1, 0, 2, 0, 3, 2, 1, 1, 1, 1, 5, 4, 2, 1, 0, 2, 1, 1, 2, 0, 0, 2, 2, 1, 1, 1, 0, 0, 0, 0, 2, 3, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

looping with iterations over two lists of variables for a multiple regression in R

一笑奈何 提交于 2020-01-04 02:20:16
问题 I want to write a loop in R to run multiple regressions with one dependent variables and two lists of independent variables (all continuous variables). The model is additive and the loop should run by iterating through the two lists of variables so that it takes the first column from the first list + the first column from the second list, then the same for the second column in the two lists etc. The problem is I can't get it to iterate through the lists properly, instead my loop runs more

Simple linear regression using pandas dataframe

僤鯓⒐⒋嵵緔 提交于 2020-01-03 11:39:07
问题 I'm looking to check trends for a number of entities (SysNr) I have data spanning 3 years (2014,2015,2016) I'm looking at a large quantity of variables, but will limit this question to one ('res_f_r') My DataFrame looks something like this d = [ {'RegnskabsAar': 2014, 'SysNr': 1, 'res_f_r': 350000}, {'RegnskabsAar': 2015, 'SysNr': 1, 'res_f_r': 400000}, {'RegnskabsAar': 2016, 'SysNr': 1, 'res_f_r': 450000}, {'RegnskabsAar': 2014, 'SysNr': 2, 'res_f_r': 350000}, {'RegnskabsAar': 2015, 'SysNr':

iterating over formulas in purrr

妖精的绣舞 提交于 2020-01-03 05:06:46
问题 I have a bunch of formulas, as strings, that I'd like to use, one at a time in a glm, preferably using tidyverse functions. Here's where I am at now. library(tidyverse) library(broom) mtcars %>% dplyr::select(mpg:qsec) %>% colnames -> targcols paste('vs ~ ', targcols) -> formulas formulas #> 'vs ~ mpg' 'vs ~ cyl' 'vs ~ disp' 'vs ~ hp' 'vs ~ drat' 'vs ~ wt' 'vs ~ qsec' I can run a general linear model with any one of these formulas as glm(as.formula(formulas[1]), family = 'binomial', data =

how to merge two linear regression prediction models (each per data frame's subset) into one column of the data frame

六眼飞鱼酱① 提交于 2020-01-03 04:25:27
问题 I would like to build 2 linear regression models that are based on 2 subsets of the dataset and then to have one column that contains the prediction values per each subset. Here is my data frame example : dat <- read.table(text = " cats birds wolfs snakes 0 3 8 7 1 3 8 7 1 1 2 3 0 1 2 3 0 1 2 3 1 6 1 1 0 6 1 1 1 6 1 1 ",header = TRUE) First I have built two models: # one is for wolfs ~ snakes where cats=0 f0<-lm(wolfs~snakes,data=dat,subset=dat$cats==0) #the second model is for wolfs ~ snakes

Method to find “cleanest” subset of data i.e. subset with lowest variability

大憨熊 提交于 2020-01-03 02:08:27
问题 I am trying to find a trend in several datasets. The trends involve finding the best fit line, but if i imagine the procedure would not be too different for any other model (just possibly more time consuming). There are 3 conceivable scenarios: All good data where all the data fits a single trend with a low variability All bad data where all or most of the data exhibits tremendous variability and the entire dataset must be discarded. Partial good data where some of the data may be good while