问题
I am currently trying to remove outliers in R in a very easy way. I know there are functions you can create on your own for this but I would like some input on this simple code and why it does not seem to work?
outliers <- boxplot(okt$pris)$out
okt_no_out <- okt[-c(outliers),]
boxplot(okt_no_out$pris)
so first row I create a vector with the outliers, the second I create a new dataframe omitting the values in that vector. But... When I check the new dataframe only about 400 of the 750 outliers were removed?
So, the vector outliers contain roughly 750 rows, but when doing this it only remove about halv of them....
So, my simple question. I might be stupid but should not these simple lines of code remove the outliers in a very convenient way?
//Peter
回答1:
boxplot$out
is returning the values for the outliers and not the positions of the outliers. So okt[-c(outliers),]
is removing random points in the data series, some of them are outliers and others are not.
What you can do is use the output from the boxplot's stats information to retrieve the end of the upper and lower whiskers and then filter your dataset using those values. See the example below:
#test data
testdata<-iris$Sepal.Width
#return boxplot object
b<-boxplot(testdata)
#find extremes from the boxplot's stats output
lowerwhisker<-b$stats[1]
upperwhisker<-b$stats[5]
#remove the extremes
testdata<-testdata[testdata>lowerwhisker & testdata<upperwhisker]
#replot
b<-boxplot(testdata)
来源:https://stackoverflow.com/questions/53201016/remove-outliers-in-r-very-easy