How can I see multiple variable's outlier in one boxplot using R?

问题

I am a newbie to R. I have a question. For checking the outlier of a variable we generally use:

boxplot(train$rate)

Suppose, the rate is the variable of my datasets and train is my data sets name. But when I have multiple variables like 100 or 150 variables, then it will be very time consuming to check one by one variable's outlier. Is there any function to bring the 100 variables' outlier in one boxplot?

If yes, then which function is used to remove those variable's outlier at one time instead of one by one? Please help to solve this problem.

Thanks in advance

回答1:

I agree with Rui Barradas that it is bad practice to remove outliers without further thought. As long as the value is valid you should keep it in your data or at least run two separate analyses with and without the influential value. You could use a for loop to apply a function to every variable in your dataset.

train2<-train # Copy old dataset
outvalue<-list() # Create two empty lists
outindex<-list()
for(i in 1:ncol(train2){ # For every column in your dataset
  outvalue[[i]]<-boxplot(train2[,i])$out # Plot and get the outlier value
  outindex[[i]]<-which(train2[,i] == outvalue[[i]]) # Get the outlier index
  train2[outindex[[i]],i] <- NA # Remove the outliers
}

This works and plots the data, but it is quite slow. If you don't want to plot the data but just want the outliers you could look into other outlier functions, the extremevalues package has a function that takes a different approach to identifying outliers and doesn't require a plot. This uses the getOutliers function from the extremevalues package

outRight<-list()
outLeft<-outRight
for(i in 1:ncol(train2){
  outRight[[i]]<-getOutliers(train2[,i])$iRight
  outLeft[[i]]<-getOutliers(train2[,i])$iLeft
  train2[outRight[[i]],i] <- NA
  train2[outLeft[[i]],i] <- NA
}

回答2:

The function boxplot returns a value. If you see the Value section of its help page you'll see that it's a list with named components, one of which is out. That's the one you seem to be looking for.

bp <- boxplot(train$rate)
bp$out
clean <- train$rate[-which(train$rate %in% bp$out)]   # to remove the outliers

I also would not do that. Outliers are data, and normal/likely to occur. By eliminating them you are not taking into account the entirety of your data, a bad practice.

来源：https://stackoverflow.com/questions/45163254/how-can-i-see-multiple-variables-outlier-in-one-boxplot-using-r

标签

linear-regression

boxplot