linear-regression

plot.lm(): extracting numbers labelled in the diagnostic Q-Q plot

前提是你 提交于 2019-12-06 19:34:48
问题 For the simple example below, you can see that there are certain points that are identified in the ensuing plots. How can I extract the row numbers identified in these plots, especially the Normal Q-Q plot? set.seed(2016) maya <- data.frame(rnorm(100)) names(maya)[1] <- "a" maya$b <- rnorm(100) mara <- lm(b~a, data=maya) plot(mara) I tried using str(mara) to see if I could find a list there, but I can't see any of the numbers from the Normal Q-Q plot there. Thoughts? 回答1: I have edited your

Predicting values using an OLS model with statsmodels

让人想犯罪 __ 提交于 2019-12-06 19:34:45
问题 I calculated a model using OLS (multiple linear regression). I divided my data to train and test (half each), and then I would like to predict values for the 2nd half of the labels. model = OLS(labels[:half], data[:half]) predictions = model.predict(data[half:]) The problem is that I get and error: File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/regression/linear_model.py", line 281, in predict return np.dot(exog, params) ValueError: matrices

Statsmodels.formula.api OLS does not show statistical values of intercept

我的未来我决定 提交于 2019-12-06 15:44:21
I am running the following source code: import statsmodels.formula.api as sm # Add one column of ones for the intercept term X = np.append(arr= np.ones((50, 1)).astype(int), values=X, axis=1) regressor_OLS = sm.OLS(endog=y, exog=X).fit() print(regressor_OLS.summary()) where X is an 50x5 (before adding the intercept term) numpy array which looks like this: [[0 1 165349.20 136897.80 471784.10] [0 0 162597.70 151377.59 443898.53]...] and y is a a 50x1 numpy array with float values for the dependent variable. The first two columns are for a dummy variable with three different values. The rest of

Out of memory when using `outer` in solving my big normal equation for least squares estimation

∥☆過路亽.° 提交于 2019-12-06 14:48:37
Consider the following example in R: x1 <- rnorm(100000) x2 <- rnorm(100000) g <- cbind(x1, x2, x1^2, x2^2) gg <- t(g) %*% g gginv <- solve(gg) bigmatrix <- outer(x1, x2, "<=") Gw <- t(g) %*% bigmatrix beta <- gginv %*% Gw w1 <- bigmatrix - g %*% beta If I try to run such a thing in my computer, it will throw a memory error (because the bigmatrix is too big). Do you know how can I achieve the same, without running into this problem? This is a least squares problem with 100,000 responses. Your bigmatrix is the response (matrix), beta is the coefficient (matrix), while w1 is the residual (matrix

Method to find “cleanest” subset of data i.e. subset with lowest variability

这一生的挚爱 提交于 2019-12-06 14:35:29
I am trying to find a trend in several datasets. The trends involve finding the best fit line, but if i imagine the procedure would not be too different for any other model (just possibly more time consuming). There are 3 conceivable scenarios: All good data where all the data fits a single trend with a low variability All bad data where all or most of the data exhibits tremendous variability and the entire dataset must be discarded. Partial good data where some of the data may be good while the rest needs to be discarded. If the net percentage of data with extreme variability is too high then

Doing linear prediction with R: How to access the predicted parameter(s)?

可紊 提交于 2019-12-06 13:30:42
I am new to R and I am trying to do linear prediction. Here is some simple data: test.frame<-data.frame(year=8:11, value= c(12050,15292,23907,33991)) Say if I want to predict the value for year=12 . This is what I am doing (experimenting with different commands): lma=lm(test.frame$value~test.frame$year) # let's get a linear fit summary(lma) # let's see some parameters attributes(lma) # let's see what parameters we can call lma$coefficients # I get the intercept and gradient predict(lm(test.frame$value~test.frame$year)) newyear <- 12 # new value for year predict.lm(lma, newyear) # predicted

Differences in Linear Regression in R and Python [closed]

╄→尐↘猪︶ㄣ 提交于 2019-12-06 11:06:10
Closed. This question is off-topic . It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . I was trying to match the linear regression R results with that of python Matching the coefficients for each of independent variable and below is the code: Data is uploaded. https://www.dropbox.com/s/oowe4irm9332s78/X.csv?dl=0 https://www.dropbox.com/s/79scp54unzlbwyk/Y.csv?dl=0 R code: #define pathname = " " X <- read.csv(file.path(pathname,"X.csv"),stringsAsFactors = F) Y <- read.csv(file.path(pathname,"Y.csv"

Get all models from leaps regsubsets

我的未来我决定 提交于 2019-12-06 10:42:59
问题 I used regsubsets to search for models. Is it possible to automatically create all lm from the list of parameter selections? library(leaps) leaps<-regsubsets(y ~ x1 + x2 + x3, data, nbest=1, method="exhaustive") summary(leaps)$which (Intercept) x1 x2 x3 1 TRUE FALSE FALSE TRUE 2 TRUE FALSE TRUE TRUE 3 TRUE TRUE TRUE TRUE Now i would manually do model_1 <- lm(y ~ x3) and so on. How can this be automated to have them in a list? 回答1: I don't know why you want a list of all models. summary and

Spark - create RDD of (label, features) pairs from CSV file

泄露秘密 提交于 2019-12-06 09:18:29
I have a CSV file and want to perform a simple LinearRegressionWithSGD on the data. A sample data is as follow (the total rows in the file is 99 including labels) and the objective is to predict the y_3 variable: y_3,x_6,x_7,x_73_1,x_73_2,x_73_3,x_8 2995.3846153846152,17.0,1800.0,0.0,1.0,0.0,12.0 2236.304347826087,17.0,1432.0,1.0,0.0,0.0,12.0 2001.9512195121952,35.0,1432.0,0.0,1.0,0.0,5.0 992.4324324324324,17.0,1430.0,1.0,0.0,0.0,12.0 4386.666666666667,26.0,1430.0,0.0,0.0,1.0,25.0 1335.9036144578313,17.0,1432.0,0.0,1.0,0.0,5.0 1097.560975609756,17.0,1100.0,0.0,1.0,0.0,5.0 3526.6666666666665,26

Applying lm() and predict() to multiple columns in a data frame

倾然丶 夕夏残阳落幕 提交于 2019-12-06 09:08:05
问题 I have an example dataset below. train<-data.frame(x1 = c(4,5,6,4,3,5), x2 = c(4,2,4,0,5,4), x3 = c(1,1,1,0,0,1), x4 = c(1,0,1,1,0,0), x5 = c(0,0,0,1,1,1)) Suppose I want to create separate models for column x3 , x4 , x5 based on column x1 and x2 . For example lm1 <- lm(x3 ~ x1 + x2) lm2 <- lm(x4 ~ x1 + x2) lm3 <- lm(x5 ~ x1 + x2) I want to then take these models and apply them to a testing set using predict, and then create a matrix that has each model outcome as a column. test <- data.frame