问题
I am a R-novice. I want to do some outlier cleaning and over-all-scaling from 0 to 1 before putting the sample into a random forest.
g<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)
If i do a simple scaling from 0 - 1 the result would be:
> round((g - min(g))/abs(max(g) - min(g)),1)
[1] 1.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0
So my idea is to replace the values of each column that are greater than the 0.95-quantile with the next value smaller than the 0.95-quantile - and the same for the 0.05-quantile.
So the pre-scaled result would be:
g<-c(**70**,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,**40**)
and scaled:
> round((g - min(g))/abs(max(g) - min(g)),1)
[1] 1.0 0.7 0.3 0.7 0.3 0.0 0.3 0.7 1.0 0.7 0.0 1.0 0.3 0.7 0.3 1.0 0.0
I need this formula for a whole dataframe, so the functional implementation within R should be something like:
> apply(c, 2, function(x) x[x`<quantile(x, 0.95)]`<-max(x[x, ... max without the quantile(x, 0.95))
Can anyone help?
Spoken beside: if there exists a function that does this job directly, please let me know. I already checked out cut
and cut2
. cut
fails because of not-unique breaks; cut2
would work, but only gives back string values or the mean value, and I need a numeric vector from 0 - 1.
for trial:
a<-c(100,6,5,6,5,4,5,6,7,6,4,7,5,6,5,7,1)
b<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)
c<-cbind(a,b)
c<-as.data.frame(c)
Regards and thanks for help,
Rainer
回答1:
Please don't do this. This is not a good strategy for dealing with outliers - particularly since it's unlikely that 10% of your data are outliers!
回答2:
I can't think of a function in R that does this, but you can define a small one yourself:
foo <- function(x)
{
quant <- quantile(x,c(0.05,0.95))
x[x < quant[1]] <- min(x[x >= quant[1]])
x[x > quant[2]] <- max(x[x <= quant[2]])
return(round((x - min(x))/abs(max(x) - min(x)),1))
}
Then sapply
this to each variable in your dataframe:
sapply(c,foo)
a b
[1,] 1.0 1.0
[2,] 0.7 0.7
[3,] 0.3 0.3
[4,] 0.7 0.7
[5,] 0.3 0.3
[6,] 0.0 0.0
[7,] 0.3 0.3
[8,] 0.7 0.7
[9,] 1.0 1.0
[10,] 0.7 0.7
[11,] 0.0 0.0
[12,] 1.0 1.0
[13,] 0.3 0.3
[14,] 0.7 0.7
[15,] 0.3 0.3
[16,] 1.0 1.0
[17,] 0.0 0.0
Edit: This answer was meant to solve the programming problem. In regard to actually using it I fully agree with Hadley
来源:https://stackoverflow.com/questions/5281883/r-outlier-cleaning-for-each-column-in-a-dataframe-by-using-quantiles-0-05-and-0