linear-regression

What is the red solid line in the “residuals vs leverage” plot produced by `plot.lm()`?

╄→尐↘猪︶ㄣ 提交于 2019-12-07 21:18:24
问题 fit <- lm(dist ~ speed, cars) plot(fit, which = 5) What does the solid red line in the middle of plot mean? I think it is not about cook's distance. 回答1: It is the LOESS regression line (with span = 2/3 and degree = 2 ), by smoothing standardised residuals against leverage. Internally in plot.lm() , variable xx is leverage, while rsp is Pearson residuals (i.e., standardised residuals). Then, the scattered plot as well as the red solid line is drawn via: graphics::panel.smooth(xx, rsp) Here is

Use a function with a linear regression model

廉价感情. 提交于 2019-12-07 19:03:07
问题 I can run multiple linear regressions, and in each model estimate coefficients by removing one observation from the data.frame like this: library(plyr) as.data.frame(laply(1:nrow(mtcars), function(x) coef(lm(mpg ~ hp + wt, mtcars[-x,])))) (Intercept) hp wt 1 37.48509 -0.03207047 -3.918260 2 37.33931 -0.03219086 -3.877571 3 37.56512 -0.03216482 -3.939386 4 37.22292 -0.03171010 -3.880721 5 37.22437 -0.03185754 -3.876831 6 37.23686 -0.03340464 -3.781698 7 37.21965 -0.03030994 -3.927877 8 37

Python Parallel Computing - Scoop

主宰稳场 提交于 2019-12-07 12:32:09
问题 I am trying to get familiar with the library Scoop (documentation here: https://media.readthedocs.org/pdf/scoop/0.7/scoop.pdf) to learn how to perform statistical computations in parallel, using in particular the futures.map function. As such, at first, I would like to try to run a simple linear regression, and assess the difference in performance between serial and parallel computations, using 10000000 data point (4 features, 1 target variable) randomly generated from a Normal Distribution.

How to compute minimal but fast linear regressions on each column of a response matrix?

巧了我就是萌 提交于 2019-12-07 08:53:38
问题 I want to compute ordinary least square ( OLS ) estimates in R without using "lm" , and this for several reasons. First, "lm" also computes lots of stuff I don't need (such as the fitted values) considering that data size is an issue in my case. Second, I want to be able to implement OLS myself in R before doing it in another language (eg. in C with the GSL). As you may know, the model is: Y=Xb+E; with E ~ N(0, sigma^2). As detailed below, b is a vector with 2 parameters, the mean (b0) and

Multiple Linear Regression with specific constraint on each coefficients on Python

耗尽温柔 提交于 2019-12-07 08:32:42
问题 I am currently running multiple linear regression on a dataset. At first, I didn't realize I needed to put constraints over my weights; as a matter of fact, I need to have specific positive & negative weights. To be more precise, I am doing a scoring system and this is why some of my variables should have a positive or negative impact on the note. Yet, when running my model, the results do not fit what I am expecting, some of my 'positive' variables get negative coefficients and vice versa.

sklearn's PLSRegression: “ValueError: array must not contain infs or NaNs”

醉酒当歌 提交于 2019-12-07 04:21:58
问题 When using sklearn.cross_decomposition.PLSRegression: import numpy as np import sklearn.cross_decomposition pls2 = sklearn.cross_decomposition.PLSRegression() xx = np.random.random((5,5)) yy = np.zeros((5,5) ) yy[0,:] = [0,1,0,0,0] yy[1,:] = [0,0,0,1,0] yy[2,:] = [0,0,0,0,1] #yy[3,:] = [1,0,0,0,0] # Uncommenting this line solves the issue pls2.fit(xx, yy) I get: C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py:44: RuntimeWarning: invalid value encountered in divide x_weights

Multiple Linear Regression in Power BI

南笙酒味 提交于 2019-12-07 04:13:18
问题 Suppose I have a set of returns and I want to compute its beta values versus different market indices. Let's use the following set of data in a table named Returns for the sake of having a concrete example: Date Equity Duration Credit Manager ----------------------------------------------- 01/31/2017 2.907% 0.226% 1.240% 1.78% 02/28/2017 2.513% 0.493% 1.120% 3.88% 03/31/2017 1.346% -0.046% -0.250% 0.13% 04/30/2017 1.612% 0.695% 0.620% 1.04% 05/31/2017 2.209% 0.653% 0.480% 1.40% 06/30/2017 0

Shaping data for linear regression with TFlearn

爱⌒轻易说出口 提交于 2019-12-07 00:09:26
I'm trying to expand the tflearn example for linear regression by increasing the number of columns to 21. from trafficdata import X,Y import tflearn print(X.shape) #(1054, 21) print(Y.shape) #(1054,) # Linear Regression graph input_ = tflearn.input_data(shape=[None,21]) linear = tflearn.single_unit(input_) regression = tflearn.regression(linear, optimizer='sgd', loss='mean_square', metric='R2', learning_rate=0.01) m = tflearn.DNN(regression) m.fit(X, Y, n_epoch=1000, show_metric=True, snapshot_epoch=False) print("\nRegression result:") print("Y = " + str(m.get_weights(linear.W)) + "*X + " +

OLS with pandas: datetime index as predictor

拥有回忆 提交于 2019-12-06 21:48:10
问题 I would like to use pandas OLS function to fit a trendline to my data Series. Does anyone knows how to use the datetime index from the pandas Series as predictor in the OLS? For example, let say that I have a simple time series: >>> ts 2001-12-31 19.828763 2002-12-31 20.112191 2003-12-31 19.509116 2004-12-31 19.913656 2005-12-31 19.701649 2006-12-31 20.022819 2007-12-31 20.103024 2008-12-31 20.132712 2009-12-31 19.850609 2010-12-31 19.290640 2011-12-31 19.936210 2012-12-31 19.664813 Freq: A

How to get the P Value in a Variable from OLSResults in Python?

二次信任 提交于 2019-12-06 20:31:11
问题 The OLSResults of df2 = pd.read_csv("MultipleRegression.csv") X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']] Y = df2['Price'] X = add_constant(X) fit = sm.OLS(Y, X).fit() print(fit.summary()) shows the P values of each attribute to only 3 decimal places. I need to extract the p value for each attribute like Distance , CarrierNum etc. and print it in scientific notation. I can extract the coefficients using fit.params[0] or fit.params[1] etc. Need to get it for all their P values.