regression | 易学教程

Why the built-in lm function is so slow in R?

阅读更多关于 Why the built-in lm function is so slow in R?

问题 I always thought that the lm function was extremely fast in R, but as this example would suggest, the closed solution computed using the solve function is way faster. data<-data.frame(y=rnorm(1000),x1=rnorm(1000),x2=rnorm(1000)) X = cbind(1,data$x1,data$x2) library(microbenchmark) microbenchmark( solve(t(X) %*% X, t(X) %*% data$y), lm(y ~ .,data=data)) Can someone explain me if this toy example is a bad example or it is the case that lm is actually slow? EDIT: As suggested by Dirk

Adding a regression line on a ggplot

阅读更多关于 Adding a regression line on a ggplot

问题 I\'m trying hard to add a regression line on a ggplot. I first tried with abline but I didn\'t manage to make it work. Then I tried this... data = data.frame(x.plot=rep(seq(1,5),10),y.plot=rnorm(50)) ggplot(data,aes(x.plot,y.plot))+stat_summary(fun.data=mean_cl_normal) + geom_smooth(method=\'lm\',formula=data$y.plot~data$x.plot) But it is not working either. 回答1: In general, to provide your own formula you should use arguments x and y that will correspond to values you provided in ggplot() -

Specifying formula in R with glm without explicit declaration of each covariate

阅读更多关于 Specifying formula in R with glm without explicit declaration of each covariate

问题 I would like to force specific variables into glm regressions without fully specifying each one. My real data set has ~200 variables. I haven\'t been able to find samples of this in my online searching thus far. For example (with just 3 variables): n=200 set.seed(39) samp = data.frame(W1 = runif(n, min = 0, max = 1), W2=runif(n, min = 0, max = 5)) samp = transform(samp, # add A A = rbinom(n, 1, 1/(1+exp(-(W1^2-4*W1+1))))) samp = transform(samp, # add Y Y = rbinom(n, 1,1/(1+exp(-(A-sin(W1^2)

Java-R integration?

阅读更多关于 Java-R integration?

问题 I have a Java app which needs to perform partial least squares regression. It would appear there are no Java implementations of PLSR out there. Weka might have had something like it at some point, but it is no longer in the API. On the other hand, I have found a good R implementation, which has an added bonus to it. It was used by the people whose result I want to replicate, which means there is less chance that things will go wrong because of differences in the way PLSR is implemented. The

Stepwise regression using p-values to drop variables with nonsignificant p-values

阅读更多关于 Stepwise regression using p-values to drop variables with nonsignificant p-values

问题 I want to perform a stepwise linear Regression using p-values as a selection criterion, e.g.: at each step dropping variables that have the highest i.e. the most insignificant p-values, stopping when all values are significant defined by some threshold alpha . I am totally aware that I should use the AIC (e.g. command step or stepAIC ) or some other criterion instead, but my boss has no grasp of statistics and insist on using p-values. If necessary, I could program my own routine, but I am

How does predict.lm() compute confidence interval and prediction interval?

阅读更多关于 How does predict.lm() compute confidence interval and prediction interval?

I ran a regression: CopierDataRegression <- lm(V1~V2, data=CopierData1) and my task was to obtain a 90% confidence interval for the mean response given V2=6 and 90% prediction interval when V2=6 . I used the following code: X6 <- data.frame(V2=6) predict(CopierDataRegression, X6, se.fit=TRUE, interval="confidence", level=0.90) predict(CopierDataRegression, X6, se.fit=TRUE, interval="prediction", level=0.90) and I got (87.3, 91.9) and (74.5, 104.8) which seems to be correct since the PI should be wider. The output for both also included se.fit = 1.39 which was the same. I don't understand what

lme4::lmer reports “fixed-effect model matrix is rank deficient”, do I need a fix and how to?

阅读更多关于 lme4::lmer reports “fixed-effect model matrix is rank deficient”, do I need a fix and how to?

问题 I am trying to run a mixed-effects model that predicts F2_difference with the rest of the columns as predictors, but I get an error message that says fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients. From this link, Fixed-effects model is rank deficient, I think I should use findLinearCombos in the R package caret . However, when I try findLinearCombos(data.df) , it gives me the error message Error in qr.default(object) : NA/NaN/Inf in foreign function call

scikit-learn cross validation, negative values with mean squared error

阅读更多关于 scikit-learn cross validation, negative values with mean squared error

问题 When I use the following code with Data matrix X of size (952,144) and output vector y of size (952), mean_squared_error metric returns negative values, which is unexpected. Do you have any idea? from sklearn.svm import SVR from sklearn import cross_validation as CV reg = SVR(C=1., epsilon=0.1, kernel=\'rbf\') scores = CV.cross_val_score(reg, X, y, cv=10, scoring=\'mean_squared_error\') all values in scores are then negative. 回答1: Trying to close this out, so am providing the answer that

fitting data with numpy

阅读更多关于 fitting data with numpy

问题 Let me start by telling that what I get may not be what I expect and perhaps you can help me here. I have the following data: >>> x array([ 3.08, 3.1 , 3.12, 3.14, 3.16, 3.18, 3.2 , 3.22, 3.24, 3.26, 3.28, 3.3 , 3.32, 3.34, 3.36, 3.38, 3.4 , 3.42, 3.44, 3.46, 3.48, 3.5 , 3.52, 3.54, 3.56, 3.58, 3.6 , 3.62, 3.64, 3.66, 3.68]) >>> y array([ 0.000857, 0.001182, 0.001619, 0.002113, 0.002702, 0.003351, 0.004062, 0.004754, 0.00546 , 0.006183, 0.006816, 0.007362, 0.007844, 0.008207, 0.008474, 0

Find p-value (significance) in scikit-learn LinearRegression

阅读更多关于 Find p-value (significance) in scikit-learn LinearRegression

问题 How can I find the p-value (significance) of each coefficient? lm = sklearn.linear_model.LinearRegression() lm.fit(x,y) 回答1: This is kind of overkill but let's give it a go. First lets use statsmodel to find out what the p-values should be import pandas as pd import numpy as np from sklearn import datasets, linear_model from sklearn.linear_model import LinearRegression import statsmodels.api as sm from scipy import stats diabetes = datasets.load_diabetes() X = diabetes.data y = diabetes