linear-regression

R-squared on test data

╄→гoц情女王★ 提交于 2019-12-03 06:31:06
I fit a linear regression model on 75% of my data set that includes ~11000 observations and 143 variables: gl.fit <- lm(y[1:ceiling(length(y)*(3/4))] ~ ., data= x[1:ceiling(length(y)*(3/4)),]) #3/4 for training , and I got an R^2 of 0.43. I then tried predicting on my test data using the rest of the data: ytest=y[(ceiling(length(y)*(3/4))+1):length(y)] x.test <- cbind(1,x[(ceiling(length(y)*(3/4))+1):length(y),]) #The rest for test yhat <- as.matrix(x.test)%*%gl.fit$coefficients #Calculate the predicted values I now would like to calculate the R^2 value on my test data. Is there any easy way

What is the most accurate method in python for computing the minimum norm solution or the solution obtained from the pseudo-inverse?

寵の児 提交于 2019-12-03 05:17:54
My goal is to solve: Kc=y with the pseudo-inverse (i.e. minimum norm solution ): c=K^{+}y such that the model is (hopefully) high degree polynomial model f(x) = sum_i c_i x^i . I am specially interested in the underdetermined case where we have more polynomial features than data (few equation too many variables/unknowns) columns = deg+1 > N = rows . Note K is the vandermode matrix of polynomial features. I was initially using the python function np.linalg.pinv but then I noticed something funky was going on as I noted here: Why do different methods for solving Xc=y in python give different

Python pandas linear regression groupby

跟風遠走 提交于 2019-12-03 04:33:01
问题 I am trying to use a linear regression on a group by pandas python dataframe: This is the dataframe df: group date value A 01-02-2016 16 A 01-03-2016 15 A 01-04-2016 14 A 01-05-2016 17 A 01-06-2016 19 A 01-07-2016 20 B 01-02-2016 16 B 01-03-2016 13 B 01-04-2016 13 C 01-02-2016 16 C 01-03-2016 16 #import standard packages import pandas as pd import numpy as np #import ML packages from sklearn.linear_model import LinearRegression #First, let's group the data by group df_group = df.groupby(

Time series prediction using R

不羁岁月 提交于 2019-12-03 03:18:44
I have the following R code library(forecast) value <- c(1.2, 1.7, 1.6, 1.2, 1.6, 1.3, 1.5, 1.9, 5.4, 4.2, 5.5, 6, 5.6, 6.2, 6.8, 7.1, 7.1, 5.8, 0, 5.2, 4.6, 3.6, 3, 3.8, 3.1, 3.4, 2, 3.1, 3.2, 1.6, 0.6, 3.3, 4.9, 6.5, 5.3, 3.5, 5.3, 7.2, 7.4, 7.3, 7.2, 4, 6.1, 4.3, 4, 2.4, 0.4, 2.4) sensor<-ts(value,frequency=24) fit <- auto.arima(sensor) LH.pred<-predict(fit,n.ahead=24) plot(sensor,ylim=c(0,10),xlim=c(0,5),type="o", lwd="1") lines(LH.pred$pred,col="red",type="o",lwd="1") grid() The resulting graph is But I am not satisfied with the prediction. Is there any way to make the prediction look

Gradient descent algorithm won't converge

◇◆丶佛笑我妖孽 提交于 2019-12-03 03:08:56
I'm trying to write out a bit of code for the gradient descent algorithm explained in the Stanford Machine Learning lecture ( lecture 2 at around 25:00 ). Below is the implementation I used at first, and I think it's properly copied over from the lecture, but it doesn't converge when I add large numbers ( >8 ) to the training set. I'm inputting a number X , and the point (X,X) is added to the training set, so at the moment, I'm only trying to get it to converge to y=ax+b where a=1=theta\[1\] and b=0=theta\[0\] . The training set is the array x and y , where (x[i],y[i]) is a point. void train()

Linear Regression :: Normalization (Vs) Standardization

时光总嘲笑我的痴心妄想 提交于 2019-12-03 02:12:36
问题 I am using Linear regression to predict data. But, I am getting totally contrasting results when I Normalize (Vs) Standardize variables. Normalization = x -xmin/ xmax – xmin Zero Score Standardization = x - xmean/ xstd a) Also, when to Normalize (Vs) Standardize ? b) How Normalization affects Linear Regression? c) Is it okay if I don't normalize all the attributes/lables in the linear regression? Thanks, Santosh 回答1: Note that the results might not necessarily be so different. You might

why gradient descent when we can solve linear regression analytically

北城以北 提交于 2019-12-03 01:50:38
问题 what is the benefit of using Gradient Descent in the linear regression space? looks like the we can solve the problem (finding theta0-n that minimum the cost func) with analytical method so why we still want to use gradient descent to do the same thing? thanks 回答1: When you use the normal equations for solving the cost function analytically you have to compute: Where X is your matrix of input observations and y your output vector. The problem with this operation is the time complexity of

OLS using statsmodel.formula.api versus statsmodel.api

我的梦境 提交于 2019-12-03 00:19:39
Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api? Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression. import numpy as np import pandas as pd import statsmodels.formula.api as smf import statsmodels.api as sm from sklearn.linear_model import LinearRegression df = pd.read_csv("C:\...\Advertising.csv") x1 = df.loc[:,['TV']] y1 = df.loc[:,['Sales']] print "Statsmodel.Formula.Api Method" model1 = smf.ols(formula='Sales ~ TV', data=df).fit()

Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)

时光总嘲笑我的痴心妄想 提交于 2019-12-02 22:29:02
I'm new to Python and trying to perform linear regression using sklearn on a pandas dataframe. This is what I did: data = pd.read_csv('xxxx.csv') After that I got a DataFrame of two columns, let's call them 'c1', 'c2'. Now I want to do linear regression on the set of (c1,c2) so I entered X=data['c1'].values Y=data['c2'].values linear_model.LinearRegression().fit(X,Y) which resulted in the following error IndexError: tuple index out of range What's wrong here? Also, I'd like to know visualize the result make predictions based on the result? I've searched and browsed a large number of sites but

Conditionally colour data points outside of confidence bands in R

久未见 提交于 2019-12-02 21:14:51
I need to colour datapoints that are outside of the the confidence bands on the plot below differently from those within the bands. Should I add a separate column to my dataset to record whether the data points are within the confidence bands? Can you provide an example please? Example dataset: ## Dataset from http://www.apsnet.org/education/advancedplantpath/topics/RModules/doc1/04_Linear_regression.html ## Disease severity as a function of temperature # Response variable, disease severity diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4) # Predictor variable, (Centigrade) temperature<