regression

Why the built-in lm function is so slow in R?

…衆ロ難τιáo~ 提交于 2019-11-26 14:22:39
问题 I always thought that the lm function was extremely fast in R, but as this example would suggest, the closed solution computed using the solve function is way faster. data<-data.frame(y=rnorm(1000),x1=rnorm(1000),x2=rnorm(1000)) X = cbind(1,data$x1,data$x2) library(microbenchmark) microbenchmark( solve(t(X) %*% X, t(X) %*% data$y), lm(y ~ .,data=data)) Can someone explain me if this toy example is a bad example or it is the case that lm is actually slow? EDIT: As suggested by Dirk

Adding a regression line on a ggplot

跟風遠走 提交于 2019-11-26 12:07:50
问题 I\'m trying hard to add a regression line on a ggplot. I first tried with abline but I didn\'t manage to make it work. Then I tried this... data = data.frame(x.plot=rep(seq(1,5),10),y.plot=rnorm(50)) ggplot(data,aes(x.plot,y.plot))+stat_summary(fun.data=mean_cl_normal) + geom_smooth(method=\'lm\',formula=data$y.plot~data$x.plot) But it is not working either. 回答1: In general, to provide your own formula you should use arguments x and y that will correspond to values you provided in ggplot() -

Specifying formula in R with glm without explicit declaration of each covariate

蓝咒 提交于 2019-11-26 12:01:43
问题 I would like to force specific variables into glm regressions without fully specifying each one. My real data set has ~200 variables. I haven\'t been able to find samples of this in my online searching thus far. For example (with just 3 variables): n=200 set.seed(39) samp = data.frame(W1 = runif(n, min = 0, max = 1), W2=runif(n, min = 0, max = 5)) samp = transform(samp, # add A A = rbinom(n, 1, 1/(1+exp(-(W1^2-4*W1+1))))) samp = transform(samp, # add Y Y = rbinom(n, 1,1/(1+exp(-(A-sin(W1^2)

Java-R integration?

﹥>﹥吖頭↗ 提交于 2019-11-26 11:57:30
问题 I have a Java app which needs to perform partial least squares regression. It would appear there are no Java implementations of PLSR out there. Weka might have had something like it at some point, but it is no longer in the API. On the other hand, I have found a good R implementation, which has an added bonus to it. It was used by the people whose result I want to replicate, which means there is less chance that things will go wrong because of differences in the way PLSR is implemented. The

Stepwise regression using p-values to drop variables with nonsignificant p-values

牧云@^-^@ 提交于 2019-11-26 11:54:20
问题 I want to perform a stepwise linear Regression using p-values as a selection criterion, e.g.: at each step dropping variables that have the highest i.e. the most insignificant p-values, stopping when all values are significant defined by some threshold alpha . I am totally aware that I should use the AIC (e.g. command step or stepAIC ) or some other criterion instead, but my boss has no grasp of statistics and insist on using p-values. If necessary, I could program my own routine, but I am

How does predict.lm() compute confidence interval and prediction interval?

微笑、不失礼 提交于 2019-11-26 11:22:40
I ran a regression: CopierDataRegression <- lm(V1~V2, data=CopierData1) and my task was to obtain a 90% confidence interval for the mean response given V2=6 and 90% prediction interval when V2=6 . I used the following code: X6 <- data.frame(V2=6) predict(CopierDataRegression, X6, se.fit=TRUE, interval="confidence", level=0.90) predict(CopierDataRegression, X6, se.fit=TRUE, interval="prediction", level=0.90) and I got (87.3, 91.9) and (74.5, 104.8) which seems to be correct since the PI should be wider. The output for both also included se.fit = 1.39 which was the same. I don't understand what

lme4::lmer reports “fixed-effect model matrix is rank deficient”, do I need a fix and how to?

拜拜、爱过 提交于 2019-11-26 10:59:35
问题 I am trying to run a mixed-effects model that predicts F2_difference with the rest of the columns as predictors, but I get an error message that says fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients. From this link, Fixed-effects model is rank deficient, I think I should use findLinearCombos in the R package caret . However, when I try findLinearCombos(data.df) , it gives me the error message Error in qr.default(object) : NA/NaN/Inf in foreign function call

scikit-learn cross validation, negative values with mean squared error

不羁的心 提交于 2019-11-26 10:33:06
问题 When I use the following code with Data matrix X of size (952,144) and output vector y of size (952), mean_squared_error metric returns negative values, which is unexpected. Do you have any idea? from sklearn.svm import SVR from sklearn import cross_validation as CV reg = SVR(C=1., epsilon=0.1, kernel=\'rbf\') scores = CV.cross_val_score(reg, X, y, cv=10, scoring=\'mean_squared_error\') all values in scores are then negative. 回答1: Trying to close this out, so am providing the answer that

fitting data with numpy

。_饼干妹妹 提交于 2019-11-26 09:18:29
问题 Let me start by telling that what I get may not be what I expect and perhaps you can help me here. I have the following data: >>> x array([ 3.08, 3.1 , 3.12, 3.14, 3.16, 3.18, 3.2 , 3.22, 3.24, 3.26, 3.28, 3.3 , 3.32, 3.34, 3.36, 3.38, 3.4 , 3.42, 3.44, 3.46, 3.48, 3.5 , 3.52, 3.54, 3.56, 3.58, 3.6 , 3.62, 3.64, 3.66, 3.68]) >>> y array([ 0.000857, 0.001182, 0.001619, 0.002113, 0.002702, 0.003351, 0.004062, 0.004754, 0.00546 , 0.006183, 0.006816, 0.007362, 0.007844, 0.008207, 0.008474, 0

Find p-value (significance) in scikit-learn LinearRegression

那年仲夏 提交于 2019-11-26 08:41:07
问题 How can I find the p-value (significance) of each coefficient? lm = sklearn.linear_model.LinearRegression() lm.fit(x,y) 回答1: This is kind of overkill but let's give it a go. First lets use statsmodel to find out what the p-values should be import pandas as pd import numpy as np from sklearn import datasets, linear_model from sklearn.linear_model import LinearRegression import statsmodels.api as sm from scipy import stats diabetes = datasets.load_diabetes() X = diabetes.data y = diabetes