linear-regression

how to check for correlation among continuous and categorical variables in python?

感情迁移 提交于 2019-12-03 14:47:20
I have a dataset including categorical variables(binary) and continuous variables. I'm trying to apply a linear regression model for predicting a continuous variable. Can someone please let me know how to check for correlation among the categorical variables and the continuous target variable. Current Code: import pandas as pd df_hosp = pd.read_csv('C:\Users\LAPPY-2\Desktop\LengthOfStay.csv') data = df_hosp[['lengthofstay', 'male', 'female', 'dialysisrenalendstage', 'asthma', \ 'irondef', 'pneum', 'substancedependence', \ 'psychologicaldisordermajor', 'depress', 'psychother', \

Gradient descent and normal equation method for solving linear regression gives different solutions

China☆狼群 提交于 2019-12-03 14:42:50
I'm working on machine learning problem and want to use linear regression as learning algorithm. I have implemented 2 different methods to find parameters theta of linear regression model: Gradient (steepest) descent and Normal equation. On the same data they should both give approximately equal theta vector. However they do not. Both theta vectors are very similar on all elements but the first one. That is the one used to multiply vector of all 1 added to the data. Here is how the theta s look like (fist column is output of Gradient descent, second output of Normal equation): Grad desc Norm

How does plot.lm() determine outliers for residual vs fitted plot?

浪尽此生 提交于 2019-12-03 13:01:41
How does plot.lm() determine what points are outliers (that is, what points to label) for residual vs fitted plot? The only thing I found in the documentation is this: Details sub.caption—by default the function call—is shown as a subtitle (under the x-axis title) on each plot when plots are on separate pages, or as a subtitle in the outer margin (if any) when there are multiple plots per page. The ‘Scale-Location’ plot, also called ‘Spread-Location’ or ‘S-L’ plot, takes the square root of the absolute residuals in order to diminish skewness (sqrt(|E|)) is much less skewed than | E | for

Gradient descent algorithm won't converge

谁说胖子不能爱 提交于 2019-12-03 12:45:28
问题 I'm trying to write out a bit of code for the gradient descent algorithm explained in the Stanford Machine Learning lecture (lecture 2 at around 25:00). Below is the implementation I used at first, and I think it's properly copied over from the lecture, but it doesn't converge when I add large numbers ( >8 ) to the training set. I'm inputting a number X , and the point (X,X) is added to the training set, so at the moment, I'm only trying to get it to converge to y=ax+b where a=1=theta\[1\]

support vector machines - a simple explanation?

南楼画角 提交于 2019-12-03 10:38:44
So, i'm trying to understand how the SVM algorithm works but i just cannot figure out how you transform some datasets in points of n-dimensional plane that would have a mathematical meaning in order to separate the points through a hyperplane and clasify them. There's an example here , they are trying to clasify pictures of tigers and elephants, they say "We digitize them into 100x100 pixel images, so we have x in n-dimensional plane, where n=10,000", but my question is how do they transform the matrices that actually represent just some color codes IN points that have a methematical meaning

In the LinearRegression method in sklearn, what exactly is the fit_intercept parameter doing?

戏子无情 提交于 2019-12-03 10:25:47
In the sklearn.linear_model.LinearRegression method, there is a parameter that is fit_intercept = TRUE or fit_intercept = FALSE . I am wondering if we set it to TRUE, does it add an additional intercept column of all 1's to your dataset? If I already have a dataset with a column of 1's, does fit_intercept = FALSE account for that or does it force it to fit a zero intercept model? Update: It seems people do not get my question. The question is basically, what IF I had already a column of 1's in my dataset of predictors (the 1's are for the intercept). THEN, 1) if I use fit_intercept = FALSE ,

how to plot the linear regression in R?

不问归期 提交于 2019-12-03 09:49:49
问题 I want to make the following case of linear regression in R year<-rep(2008:2010,each=4) quarter<-rep(1:4,3) cpi<-c(162.2,164.6,166.5,166.0,166.4,167.0,168.6,169.5,170.0,172.0,173.3,174.0) plot(cpi,xaxt="n",ylab="CPI",xlab="") axis(1,labels=paste(year,quarter,sep="C"),at=1:12,las=3) fit<-lm(cpi~year+quarter) I want to plot the line that shows the linear regression of the data that I process. I have tried with: abline(fit) abline(fit$coefficients[[1]],c(fit$coefficients[[2]],fit$coefficients[[3

Plot linear model in 3d with Matplotlib

北慕城南 提交于 2019-12-03 08:31:16
I'm trying to create a 3d plot of a linear model fit for a data set. I was able to do this relatively easily in R, but I'm really struggling to do the same in Python. Here is what I've done in R: Here's what I've done in Python: from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt import numpy as np import pandas as pd import statsmodels.formula.api as sm csv = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0) model = sm.ols(formula='Sales ~ TV + Radio', data = csv) fit = model.fit() fit.summary() fig = plt.figure() ax = fig.add_subplot(111,

How to obtain RMSE out of lm result?

不问归期 提交于 2019-12-03 07:38:24
问题 I know there is a small difference between $sigma and the concept of root mean squared error . So, i am wondering what is the easiest way to obtain RMSE out of lm function in R ? res<-lm(randomData$price ~randomData$carat+ randomData$cut+randomData$color+ randomData$clarity+randomData$depth+ randomData$table+randomData$x+ randomData$y+randomData$z) length(coefficients(res)) contains 24 coefficient, and I cannot make my model manually anymore. So, how can I evaluate the RMSE based on

Converting Numpy Lstsq residual value to R^2

流过昼夜 提交于 2019-12-03 07:17:43
I am performing a least squares regression as below (univariate). I would like to express the significance of the result in terms of R^2. Numpy returns a value of unscaled residual, what would be a sensible way of normalizing this. field_clean,back_clean = rid_zeros(backscatter,field_data) num_vals = len(field_clean) x = field_clean[:,row:row+1] y = 10*log10(back_clean) A = hstack([x, ones((num_vals,1))]) soln = lstsq(A, y ) m, c = soln [0] residues = soln [1] print residues See http://en.wikipedia.org/wiki/Coefficient_of_determination Your R2 value = 1 - residual / sum((y - y.mean())**2)