Normalisation of a two column data using min and max values

谁说胖子不能爱 提交于 2019-12-02 06:12:55

问题


I am trying to find an R code for normalisation of my values using min and max value for a two column matrix.

My matrix looks like this: Column one (C1) and C2 I.D not to be calculated, C3; Heading row 1 then 407 numbers and NA´s, C4;heading row 1 then numbers and NA´s.

I was thinking of something like:

Min value for C3 = x, Max value for same column x,

If(x="","NA",(x-Min value)/(Max value-Min value))

This would give a column with values from 0 to 1. The same should be done for column 4 (would that be y or is this confusing for R?)

I am not skilled enough in programming or in R to generate this code, is there a specific code for this or can anyone help me write one?


回答1:


Given some example data along the lines you describe

set.seed(1)
d <- data.frame(C1 = LETTERS[1:4], C2 = letters[1:4],
                C3 = runif(4, min = 0, max = 10),
                C4 = runif(4, min = 0, max = 10))
d

then we can write a simple function to do the normalisation you describe

normalise <- function(x, na.rm = TRUE) {
    ranx <- range(x, na.rm = na.rm)
    (x - ranx[1]) / diff(ranx)
}

This can be applied to the data in a number of ways, but here I use apply():

apply(d[, 3:4], 2, normalise)

which gives

R> apply(d[, 3:4], 2, normalise)
            C3        C4
[1,] 0.0000000 0.0000000
[2,] 0.1658867 0.9377039
[3,] 0.4782093 1.0000000
[4,] 1.0000000 0.6179273

To add these to the existing data, we could do:

d2 <- data.frame(d, apply(d[, 3:4], 2, normalise))
d2

Which gives:

R> d2
  C1 C2       C3       C4      C3.1      C4.1
1  A  a 2.655087 2.016819 0.0000000 0.0000000
2  B  b 3.721239 8.983897 0.1658867 0.9377039
3  C  c 5.728534 9.446753 0.4782093 1.0000000
4  D  d 9.082078 6.607978 1.0000000 0.6179273

Now you mentioned that your data include NA and we must handle that. You may have noticed that I set the na.rm argument to TRUE in the normalise() function. This means it will work even in the presence of NA:

d3 <- d
d3[c(1,3), c(3,4)] <- NA ## set some NA
d3


R> d3
  C1 C2       C3       C4
1  A  a       NA       NA
2  B  b 3.721239 8.983897
3  C  c       NA       NA
4  D  d 9.082078 6.607978

With normalise() we still get some output that is of use, using only the non-NA data:

R> apply(d3[, 3:4], 2, normalise)
     C3 C4
[1,] NA NA
[2,]  0  1
[3,] NA NA
[4,]  1  0

If we had not done this in writing normalise(), then the output would look something like this (na.rm = FALSE is the default for range() and other similar functions!)

R> apply(d3[, 3:4], 2, normalise, na.rm = FALSE)
     C3 C4
[1,] NA NA
[2,] NA NA
[3,] NA NA
[4,] NA NA



回答2:


This is a type of non-parametric normalisation, but I would advise you to use another method: calculate the median and interquartile range, subtract the median and divide by the IQR. This will give you a distribution with median 0 and IQR 1.

m <- median( df$C3, na.rm = T )
iqr <- IQR( df$C3, na.rm = T )
df$C3 <- ( df$C3 - m ) / iqr

The method that you propose is extremely sensitive to outliers. If you really want to do it, this is how:

 rng <- range( df$C3, na.rm = T )
 df$C3 <- ( df$C3 - rng[1] ) / ( rng[2] - rng[1] )


来源:https://stackoverflow.com/questions/12969623/normalisation-of-a-two-column-data-using-min-and-max-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!