linear-regression

Python pandas has no attribute ols - Error (rolling OLS)

亡梦爱人 提交于 2019-12-05 01:26:48
问题 For my evaluation, I wanted to run a rolling 1000 window OLS regression estimation of the dataset found in this URL: https://drive.google.com/open?id=0B2Iv8dfU4fTUa3dPYW5tejA0bzg using the following Python script. # /usr/bin/python -tt import numpy as np import matplotlib.pyplot as plt import pandas as pd from statsmodels.formula.api import ols df = pd.read_csv('estimated.csv', names=('x','y')) model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['y']], window_type='rolling', window=1000, intercept

AnalysisException: u\"cannot resolve 'name' given input columns: [ list] in sqlContext in spark

岁酱吖の 提交于 2019-12-05 01:11:33
I tried a simple example like: data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/databricks-datasets/samples/population-vs-price/data_geo.csv") data.cache() # Cache data for faster reuse data = data.dropna() # drop rows with missing values data = data.select("2014 Population estimate", "2015 median sales price").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF() It works well, But when i try something very similar like: data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load('/mnt/%s

Predicting values using an OLS model with statsmodels

断了今生、忘了曾经 提交于 2019-12-05 00:52:17
I calculated a model using OLS (multiple linear regression). I divided my data to train and test (half each), and then I would like to predict values for the 2nd half of the labels. model = OLS(labels[:half], data[:half]) predictions = model.predict(data[half:]) The problem is that I get and error: File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/regression/linear_model.py", line 281, in predict return np.dot(exog, params) ValueError: matrices are not aligned I have the following array shapes: data.shape: (426, 215) labels.shape: (426,) If I

plot.lm(): extracting numbers labelled in the diagnostic Q-Q plot

这一生的挚爱 提交于 2019-12-04 23:32:38
For the simple example below, you can see that there are certain points that are identified in the ensuing plots. How can I extract the row numbers identified in these plots, especially the Normal Q-Q plot? set.seed(2016) maya <- data.frame(rnorm(100)) names(maya)[1] <- "a" maya$b <- rnorm(100) mara <- lm(b~a, data=maya) plot(mara) I tried using str(mara) to see if I could find a list there, but I can't see any of the numbers from the Normal Q-Q plot there. Thoughts? I have edited your question using set.seed(2016) for reproducibility. To answer your question, I need to explain how to produce

Covariance matrix from np.polyfit() has negative diagonal?

陌路散爱 提交于 2019-12-04 22:25:45
Problem: the cov=True option of np.polyfit() produces a diagonal with non-sensical negative values. UPDATE: after playing with this some more, I am really starting to suspect a bug in numpy ? Is that possible? Deleting any pair of 13 values from the dataset will fix the problem. I am using np.polyfit() to calculate the slope and intercept coefficients of a dataset. A plot of the values produces a very linear (but not perfectly) linear graph. I am attempting to get the standard deviation on these coefficients with np.sqrt(np.diag(cov)) ; however, this throws an error because the diagonal

R - Linear Regression - Control for a variable

冷暖自知 提交于 2019-12-04 21:16:05
I have a computer science background & I am trying to teach myself data science by solving the problems available on the internet I have a smallish data set which has 3 variables - race, gender and annual income. There are about 10,000 sample observations. I am trying to predict income from race & gender. I have divided the data into 2 parts - one for each gender & now I am trying to create 2 regression models. Is this possible in R? Can some one provide example syntax. You don't specify how your data are stored or how the variable race is recorded (is it a factor?) [If you're just fitting

How does plot.lm() determine outliers for residual vs fitted plot?

限于喜欢 提交于 2019-12-04 19:50:57
问题 How does plot.lm() determine what points are outliers (that is, what points to label) for residual vs fitted plot? The only thing I found in the documentation is this: Details sub.caption—by default the function call—is shown as a subtitle (under the x-axis title) on each plot when plots are on separate pages, or as a subtitle in the outer margin (if any) when there are multiple plots per page. The ‘Scale-Location’ plot, also called ‘Spread-Location’ or ‘S-L’ plot, takes the square root of

How to extract the regression coefficient from statsmodels.api?

故事扮演 提交于 2019-12-04 19:07:19
问题 result = sm.OLS(gold_lookback, silver_lookback ).fit() After I get the result, how can I get the coefficient and the constant? In other words, if y = ax + c how to get the values a and c ? 回答1: You can use the params property of a fitted model to get the coefficients. For example, the following code: import statsmodels.api as sm import numpy as np np.random.seed(1) X = sm.add_constant(np.arange(100)) y = np.dot(X, [1,2]) + np.random.normal(size=100) result = sm.OLS(y, X).fit() print(result

How can I obtain segmented linear regressions with a priori breakpoints?

好久不见. 提交于 2019-12-04 18:02:19
I need to explain this in excruciating detail because I don't have the basics of statistics to explain in a more succinct way. Asking here in SO because I am looking for a python solution, but might go to stats.SE if more appropriate. I have downhole well data, it might be a bit like this: Rt T 0.0000 15.0000 4.0054 15.4523 25.1858 16.0761 27.9998 16.2013 35.7259 16.5914 39.0769 16.8777 45.1805 17.3545 45.6717 17.3877 48.3419 17.5307 51.5661 17.7079 64.1578 18.4177 66.8280 18.5750 111.1613 19.8261 114.2518 19.9731 121.8681 20.4074 146.0591 21.2622 148.8134 21.4117 164.6219 22.1776 176.5220 23

Applying lm() and predict() to multiple columns in a data frame

淺唱寂寞╮ 提交于 2019-12-04 17:26:19
I have an example dataset below. train<-data.frame(x1 = c(4,5,6,4,3,5), x2 = c(4,2,4,0,5,4), x3 = c(1,1,1,0,0,1), x4 = c(1,0,1,1,0,0), x5 = c(0,0,0,1,1,1)) Suppose I want to create separate models for column x3 , x4 , x5 based on column x1 and x2 . For example lm1 <- lm(x3 ~ x1 + x2) lm2 <- lm(x4 ~ x1 + x2) lm3 <- lm(x5 ~ x1 + x2) I want to then take these models and apply them to a testing set using predict, and then create a matrix that has each model outcome as a column. test <- data.frame(x1 = c(4,3,2,1,5,6), x2 = c(4,2,1,6,8,5)) p1 <- predict(lm1, newdata = test) p2 <- predict(lm2,