Jittering effects on LOESS

问题

I asked a previous question regarding LOESS errors and warnings : LOESS warnings/errors related to span in R . The issue related to warnings like these that occurred while trying to run a LOESS regression on my data set.

Warning messages:

1: In simpleLoess(y, x, w, span, degree = degree, parametric = parametric,  :   pseudoinverse used at -2703.9

2: In simpleLoess(y, x, w, span, degree = degree, parametric = parametric,  :   neighborhood radius 796.09

3: In simpleLoess(y, x, w, span, degree = degree, parametric = parametric,  :   reciprocal condition number  0

4: In simpleLoess(y, x, w, span, degree = degree, parametric = parametric,  :   There are other near singularities as well. 6.1623e+005

The question regarding the warnings was answered, and I was recommended to add in some jittering in order to avoid the issue the algorithm for loess finding numerical difficulties due to the x axis having a few values which are repeated a relatively large number of times.

Jittering the data avoided the previous errors and warnings when running the LOESS regression, but the results of the lines was quite different at one point. No matter how minimal the jitter, the results still turned out differently than the non-jittered results.

Here is an example of a data set that is having issues:

Period  Value   Total1  Total2
-2950   0.104938272 32.4    3.4  
-2715   0.054347826 46  2.5  
-2715   0.128378378 37  4.75  
-2715   0.188679245 39.75   7.5  
-3500   0.245014245 39  9.555555556  
-3500   0.163120567 105.75  17.25  
-3500   0.086956522 28.75   2.5  
-4350   0.171038825 31.76666667 5.433333333  
-3650   0.143798024 30.36666667 4.366666667  
-4350   0.235588972 26.6    6.266666667  
-3500   0.228840125 79.75   18.25  
-4933   0.154931973 70  10.8452381  
-4350   0.021428571 35  0.75  
-3500   0.0625  28  1.75  
-2715   0.160714286 28  4.5  
-2715   0.110047847 52.25   5.75  
-3500   0.176923077 32.5    5.75  
-3500   0.226277372 34.25   7.75  
-2715   0.132625995 188.5   25

And here is the data without the line-breaks

Period  Value   Total1  Total2
-2950   0.104938272 32.4    3.4
-2715   0.054347826 46  2.5
-2715   0.128378378 37  4.75
-2715   0.188679245 39.75   7.5
-3500   0.245014245 39  9.555555556
-3500   0.163120567 105.75  17.25
-3500   0.086956522 28.75   2.5
-4350   0.171038825 31.76666667 5.433333333
-3650   0.143798024 30.36666667 4.366666667
-4350   0.235588972 26.6    6.266666667
-3500   0.228840125 79.75   18.25
-4933   0.154931973 70  10.8452381
-4350   0.021428571 35  0.75
-3500   0.0625  28  1.75
-2715   0.160714286 28  4.5
-2715   0.110047847 52.25   5.75
-3500   0.176923077 32.5    5.75
-3500   0.226277372 34.25   7.75
-2715   0.132625995 188.5   25

Here is the code I am using:

Analysis <- read.csv(file.choose(), header = T)
plot(Value ~ Period, Analysis)
a <- order(Analysis$Period)
Analysis.lo <- loess(Value ~ Period, Analysis, weights = Total1)
pred <- predict(Analysis.lo, se = TRUE)
lines(Analysis$Period[a], pred$fit[a], col="red", lwd=3)
lines(Analysis$Period[a], pred$fit[a] - qt(0.975, pred$df)*pred$se[a],lty=2)
lines(Analysis$Period[a], pred$fit[a] + qt(0.975,pred$df)*pred$se[a],lty=2)

First image is without jittering

In this next graph, I ran the original data in blue, and a jittered version using default factor for jitter in red. This makes me wonder which regression line is more valid. The non-jittered line looks to fit better to the eye, but the fact that changing the jittering factor seems to have such a little effect on the regression line makes me think that something is significantly different between how the jittered and non-jitter regressions are run. I am trying to figure out exactly what is going on here.

Analysis <- read.csv(file.choose(), header = T)
table(Analysis$Period)
Analysis$Period <- jitter(Analysis$Period)
plot(Value ~ Period, Analysis)
a <- order(Analysis$Period)
Analysis.lo <- loess(Value ~ Period, Analysis, weights = Total1)
pred <- predict(Analysis.lo, se = TRUE)
lines(Analysis$Period[a], pred$fit[a], col="red", lwd=2)
lines(Analysis$Period[a], pred$fit[a] - qt(0.975, pred$df)*pred$se[a],lty=2)
lines(Analysis$Period[a], pred$fit[a] + qt(0.975,pred$df)*pred$se[a],lty=2)

Jittered with defaults

Following the example answer by Hack-R exactly results in this:

Analysis <- read.csv(file.choose(), header = T)
plot(Value ~ Period, Analysis)
a               <- order(Analysis$Period)
no_jitter       <- Analysis$Period
Analysis$Period <- jitter(Analysis$Period)
Analysis.lo     <- loess(Value ~ Period, Analysis, weights = Total1)
pred            <- predict(Analysis.lo, se = TRUE)
lines(Analysis$Period[a], pred$fit[a], col="red", lwd=3)
lines(no_jitter[a], pred$fit[a], col="blue", lwd=3)
lines(Analysis$Period[a], pred$fit[a] - qt(0.975, pred$df)*pred$se[a],lty=2)
lines(Analysis$Period[a], pred$fit[a] + qt(0.975,pred$df)*pred$se[a],lty=2)

Removing the non-jittered version using the same code results in this though.

Analysis <- read.csv(file.choose(), header = T)
plot(Value ~ Period, Analysis)
a               <- order(Analysis$Period)
Analysis$Period <- jitter(Analysis$Period)
Analysis.lo     <- loess(Value ~ Period, Analysis, weights = Total1)
pred            <- predict(Analysis.lo, se = TRUE)
lines(Analysis$Period[a], pred$fit[a], col="red", lwd=3)
lines(Analysis$Period[a], pred$fit[a] - qt(0.975, pred$df)*pred$se[a],lty=2)
lines(Analysis$Period[a], pred$fit[a] + qt(0.975,pred$df)*pred$se[a],lty=2)

In response to Hack-R's sample code, I noticed that the non-jittered version didn't come up with the same original errors/warnings I had with my initial code. I ran the sample code by Hack-R and then added my original code to get these results.

Hack-R code with original code added at the end:

Analysis <- read.csv(file.choose(), header = T)
plot(Value ~ Period, Analysis)
a               <- order(Analysis$Period)
no_jitter       <- Analysis$Period
Analysis$Period <- jitter(Analysis$Period)
Analysis.lo     <- loess(Value ~ Period, Analysis, weights = Total1)
pred            <- predict(Analysis.lo, se = TRUE)
lines(Analysis$Period[a], pred$fit[a], col="red", lwd=3)
lines(no_jitter[a], pred$fit[a], col="blue", lwd=3)
lines(Analysis$Period[a], pred$fit[a] - qt(0.975, pred$df)*pred$se[a],lty=2)
lines(Analysis$Period[a], pred$fit[a] + qt(0.975,pred$df)*pred$se[a],lty=2)
Analysis2 <- read.csv(file.choose(), header = T)
points(Value ~ Period, Analysis2)
b <- order(Analysis2$Period)
Analysis2.lo <- loess(Value ~ Period, Analysis2, weights = Total1)
pred2 <- predict(Analysis2.lo, se = TRUE)
lines(Analysis2$Period[b], pred2$fit[a], col="orange", lwd=3)
lines(Analysis2$Period[b], pred2$fit[a] - qt(0.975,    pred$df)*pred$se[b],lty=3)
lines(Analysis2$Period[b], pred$fit[a] + qt(0.975, pred$df)*pred$se[b],lty=3)

I am still at a bit of a loss as to where things are going awry, but I suspect that the provided jittering and non-jittering solution is not actually comparing a jittered sample to the original data.

Thanks for the help.

UPDATE

Looking over the jittered and non-jittered code, I noticed that only one LOESS line was run, and only one set of predicted values along the LOESS line was run. Both of these appear to be referencing the original values. If that was completely the case though, I don't see why there wouldn't be the same warnings as the original regression. In an attempt to make sure what is being done line by line is clear, I listed the code below with my own comments of what I believe is going on. I'm sure I am missing something here though.

#define "Analysis" as the CSV file
Analysis <- read.csv(file.choose(), header = T)

#plot initial points
plot(Value ~ Period, Analysis)

#order points
a               <- order(Analysis$Period)

#define the period values from "Analysis" without any alterations and define as "no_jitter"
no_jitter       <- Analysis$Period

#create jittered values for the period values from "Analysis" and define them as Analysis$Period
Analysis$Period <- jitter(Analysis$Period)

#define the LOESS (for the original data set) 
Analysis.lo     <- loess(Value ~ Period, Analysis, weights = Total1)

#predict values along LOESS curve (for the original data set)
pred            <- predict(Analysis.lo, se = TRUE)

#plot loess line for jittered values (but the pred function is referencing [a] which is the ordered Period values before they were jittered)
lines(Analysis$Period[a], pred$fit[a], col="red", lwd=3)

#plot loess line for non-jittered values (which are still referencing the original values ordered in [a])
lines(no_jitter[a], pred$fit[a], col="blue", lwd=3)

#confidence intervals for jittered values (same referencing issues as above)
lines(Analysis$Period[a], pred$fit[a] - qt(0.975, pred$df)*pred$se[a],lty=2)
lines(Analysis$Period[a], pred$fit[a] + qt(0.975,pred$df)*pred$se[a],lty=2)

I have tried running the code through a number of times. I have found that by controlling the jittering factor I can generally keep the jittered regression close to the non-jittered regression. I have found that on occasion, even when the data points are jittered, I get some of the same errors that I did before jittering the data. I have a feeling that this might relate to how the points were jittered in that particular instance. The jittering might move the points far enough away sometimes to avoid the warnings, while other times the jittering doesn't move the points fat enough away. Being a random factor, this seems difficult to control. I will try to keep increasing the jittering factor to see if there is a point where this error doesn't occur without moving the points too far to significantly affect the line. I will update later.

来源：https://stackoverflow.com/questions/38948553/jittering-effects-on-loess

标签

plot

loess