regression

Difference between the interaction : and * term for formulas in StatsModels OLS regression

落花浮王杯 提交于 2019-12-03 15:21:25
Hi I'm learning Statsmodel and can't figure out the difference between : and * (interaction terms) for formulas in StatsModels OLS regression. Could you please give me a hint to figure this out? Thank you! The documentation: http://statsmodels.sourceforge.net/devel/example_formulas.html Yaron ":" will give a regression without the level itself. just the interaction you have mentioned. "*" will give a regression with the level itself + the interaction you have mentioned. for example a . GLMmodel = glm("y ~ a: b" , data = df) you'll have only one independent variable which is the results of "a"

B Spline confusion

女生的网名这么多〃 提交于 2019-12-03 14:35:21
I realise that there are posts on the topic of B-Splines on this board but those have actually made me more confused so I thought someone might be able to help me. I have simulated data for x-values ranging from 0 to 1. I'd like to fit to my data a cubic spline ( degree = 3 ) with knots at 0, 0.1, 0.2, ... , 0.9, 1. I'd also like to use the B-Spline basis and OLS for parameter estimation (I'm not looking for penalised splines). I think I need the bs function from the spline package but I'm not quite sure and I also don't know what exactly to feed it. I'd also like to plot the resulting

Performing lm() and segmented() on multiple columns in R

不羁岁月 提交于 2019-12-03 14:15:01
问题 I am trying to perform lm() and segmented() in R using the same independent variable (x) and multiple dependent response variables (Curve1, Curve2, etc.) one by one. I wish to extract the estimated break point and model coefficients for each response variable. I include an example of my data below. x Curve1 Curve2 Curve3 1 -0.236422 98.8169 95.6828 101.7910 2 -0.198083 98.3260 95.4185 101.5170 3 -0.121406 97.3442 94.8899 100.9690 4 0.875399 84.5815 88.0176 93.8424 5 0.913738 84.1139 87.7533

Gaussian Process scikit-learn - Exception

孤街浪徒 提交于 2019-12-03 13:34:16
I want to use Gaussian Processes to solve a regression task. My data is as follow : each X vector has a length of 37, and each Y vector has a length of 8. I'm using the sklearn package in Python but trying to use gaussian processes leads to an Exception : from sklearn import gaussian_process print "x :", x__ print "y :", y__ gp = gaussian_process.GaussianProcess(theta0=1e-2, thetaL=1e-4, thetaU=1e-1) gp.fit(x__, y__) x : [[ 136. 137. 137. 132. 130. 130. 132. 133. 134. 135. 135. 134. 134. 1139. 1019. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 70. 24. 55. 0. 9. 0. 0.] [ 136. 137. 137. 132. 130

Calculate confidence band of least-square fit

ⅰ亾dé卋堺 提交于 2019-12-03 13:29:43
问题 I got a question that I fight around for days with now. How do I calculate the (95%) confidence band of a fit? Fitting curves to data is the every day job of every physicist -- so I think this should be implemented somewhere -- but I can't find an implementation for this neither do I know how to do this mathematically. The only thing I found is seaborn that does a nice job for linear least-square. import numpy as np from matplotlib import pyplot as plt import seaborn as sns import pandas as

Panel data regression: Robust standard errors

安稳与你 提交于 2019-12-03 13:27:45
问题 my problem is this: I get NA where I should get some values in the computation of robust standard errors. I am trying to do a fixed effect panel regression with cluster-robust standard errors. For this, I follow Arai (2011) who on p. 3 follows Stock/ Watson (2006) (later published in Econometrica, for those who have access). I would like to correct the degrees of freedom by (M/(M-1)*(N-1)/(N-K) against downward bias as my number of clusters is finite and I have unbalanced data. Similar

How does plot.lm() determine outliers for residual vs fitted plot?

浪尽此生 提交于 2019-12-03 13:01:41
How does plot.lm() determine what points are outliers (that is, what points to label) for residual vs fitted plot? The only thing I found in the documentation is this: Details sub.caption—by default the function call—is shown as a subtitle (under the x-axis title) on each plot when plots are on separate pages, or as a subtitle in the outer margin (if any) when there are multiple plots per page. The ‘Scale-Location’ plot, also called ‘Spread-Location’ or ‘S-L’ plot, takes the square root of the absolute residuals in order to diminish skewness (sqrt(|E|)) is much less skewed than | E | for

Neural Network Ordinal Classification for Age

旧时模样 提交于 2019-12-03 12:54:48
I have created a simple neural network (Python, Theano) to estimate a persons age based on their spending history from a selection of different stores. Unfortunately, it is not particularly accurate. The accuracy might be hurt by the fact that the network has no knowledge of ordinality. For the network there is no relationship between the age classifications. It is currently selecting the age with the highest probability from the softmax output layer. I have considered changing the output classification to an average of the weighted probability for each age. E.g Given age probabilities: (Age

loess predict with new x values

戏子无情 提交于 2019-12-03 12:12:36
问题 I am attempting to understand how the predict.loess function is able to compute new predicted values ( y_hat ) at points x that do not exist in the original data. For example (this is a simple example and I realize loess is obviously not needed for an example of this sort but it illustrates the point): x <- 1:10 y <- x^2 mdl <- loess(y ~ x) predict(mdl, 1.5) [1] 2.25 loess regression works by using polynomials at each x and thus it creates a predicted y_hat at each y . However, because there

Loss suddenly increases with Adam Optimizer in Tensorflow

 ̄綄美尐妖づ 提交于 2019-12-03 11:15:30
I am using a CNN for a regression task. I use Tensorflow and the optimizer is Adam. The network seems to converge perfectly fine till one point where the loss suddenly increases along with the validation error. Here are the loss plots of the labels and the weights separated (Optimizer is run on the sum of them) I use l2 loss for weight regularization and also for the labels. I apply some randomness on the training data. I am currently trying RSMProp to see if the behavior changes but it takes at least 8h to reproduce the error. I would like to understand how this can happen. Hope you can help