问题
Suppose I have the following data:
a <- data.frame(var1=letters,var2=runif(26))
Suppose I want to scale every value in var2 such that the sum of the var2 column is equal to 1 (basically turn the var2 column into a probability distribution)
I have tried the following:
a$var2 <- lapply(a$var2,function(x) (x-min(a$var2))/(max(a$var2)-min(a$var2)))
this not only gives an overall sum greater than 1 but also turns the var2 column into a list on which I can't do operations like sum
Is there any valid way of turning this column into a probability distribution?
回答1:
Suppose you have a vector x with non-negative values and no NA, you can normalize it by
x / sum(x)
which is a proper probability mass function.
The transform you take:
(x - min(x)) / (max(x) - min(x))
only rescales x onto [0, 1], but does not ensure "summation to 1".
Regarding you code
There is no need to use lapply here:
lapply(a$var2, function(x) (x-min(a$var2)) / (max(a$var2) - min(a$var2)))
Just use vectorized operation
a$var2 <- with(a, (var2 - min(var2)) / (max(var2) - min(var2)))
As you said, lapply gives you a list, and that is what "l" in "lapply" refers to. You can use unlist to collapse that list into a vector; or, you can use sapply, where "s" implies "simplification (when possible)".
来源:https://stackoverflow.com/questions/39323277/normalize-data-in-r-data-frame-column