问题
I am a newbie to R. I have a question. For checking the outlier of a variable we generally use:
boxplot(train$rate)
Suppose, the rate is the variable of my datasets and train is my data sets name. But when I have multiple variables like 100 or 150 variables, then it will be very time consuming to check one by one variable's outlier. Is there any function to bring the 100 variables' outlier in one boxplot?
If yes, then which function is used to remove those variable's outlier at one time instead of one by one? Please help to solve this problem.
Thanks in advance
回答1:
I agree with Rui Barradas that it is bad practice to remove outliers without further thought. As long as the value is valid you should keep it in your data or at least run two separate analyses with and without the influential value. You could use a for loop to apply a function to every variable in your dataset.
train2<-train # Copy old dataset
outvalue<-list() # Create two empty lists
outindex<-list()
for(i in 1:ncol(train2){ # For every column in your dataset
outvalue[[i]]<-boxplot(train2[,i])$out # Plot and get the outlier value
outindex[[i]]<-which(train2[,i] == outvalue[[i]]) # Get the outlier index
train2[outindex[[i]],i] <- NA # Remove the outliers
}
This works and plots the data, but it is quite slow. If you don't want to plot the data but just want the outliers you could look into other outlier functions, the extremevalues
package has a function that takes a different approach to identifying outliers and doesn't require a plot.
This uses the getOutliers
function from the extremevalues
package
outRight<-list()
outLeft<-outRight
for(i in 1:ncol(train2){
outRight[[i]]<-getOutliers(train2[,i])$iRight
outLeft[[i]]<-getOutliers(train2[,i])$iLeft
train2[outRight[[i]],i] <- NA
train2[outLeft[[i]],i] <- NA
}
回答2:
The function boxplot
returns a value. If you see the Value
section of its help page you'll see that it's a list with named components, one of which is out
. That's the one you seem to be looking for.
bp <- boxplot(train$rate)
bp$out
clean <- train$rate[-which(train$rate %in% bp$out)] # to remove the outliers
I also would not do that. Outliers are data, and normal/likely to occur. By eliminating them you are not taking into account the entirety of your data, a bad practice.
来源:https://stackoverflow.com/questions/45163254/how-can-i-see-multiple-variables-outlier-in-one-boxplot-using-r