问题
I'm struggling with this now for a few days. This is now the 3rd question at stackoverflow about the same topic, hope this time my question is better defined.
My data are distributed like this: (histogram)
The x-axis correspond to the range of probabilities: from 0 to 1.
I want to assign states from state 1 to state 10 sensibly to the probability range.
This is what I have got:
Interval <- round(quantile(datag, c(seq(0,1,by=0.10))),3)
output:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.000 0.008 0.015 0.024 0.036 0.054 0.080 0.124 0.209 0.397 1.000
Assign states from 0 to 10:
States <- data.frame(datag, State=findInterval(datag, Interval))
head(States)
Output: States
Probability State
0.20585012 8
0.21202839 9
0.07087725 6
0.7109513 10
0.9641807 10
The problem is this: As you can see above, I have already state 9 for probability 0.2120 and state 10 for > 0.710. I would be happy with something like prob=0.2120 state 4 and prob=0.710 state 7 and prob=0.96 = state 10.
So how to assign states more uniformly?
To replicate the datag:
datag <- data.frame(Probability=rgamma(10000, shape=0.6, rate=4.8, scale=1/4.8))
EDIT: @Roman:
datag <- subset(datag, Probability<=1)
EDIT: @Simon
Yes, I'm aware of "cut":
table(cut(datag, breaks = c(seq(0,0.8,by=0.1))))
Output:
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8]
125545 26625 12795 8126 5556 4108 3227 2606
How would one define the breaks? I after the intervals (breaks themselfs) so I can assign the states corresponding to the interval the probability falls in.
回答1:
You've basically got the answer in your OP! Don't take this the wrong way, but I think you need to spend some more time reading the documentation for ?cut! If you set labels = FALSE in cut you get the integer codes that each break corresponds to.
# Set a seed for true reproducibility!
set.seed(1)
datag <- data.frame(Probability=rgamma(10000, shape=0.6, rate=4.8, scale=1/4.8))
Int <- cut( datag$Probability , breaks = seq(0 , 1 , by = 0.1 ) , lab = FALSE )
head( cbind( Prob = datag$Probability , Int ) )
Prob Int
[1,] 0.031860645 1
[2,] 0.455054687 5
[3,] 0.134175238 2
[4,] 0.058957301 1
[5,] 0.855493999 9
[6,] 0.009144936 1
回答2:
I ran
datag <- data.frame(Probability=rgamma(10000, shape=0.6, rate=4.8, scale=1/4.8))
datag <- subset(datag, Probability<=1)
the first one gives a warning which apparently you ignored, but onward:
And if these are supposed to be probabilities, the second step shouldn't be needed. But onward
You used quantiles; datag is not uniform at all, so you got what you got. If you want to divide datag differently you can use cut. E.g, for 10 classes, evenly spaced:
datagcut <- cut(datag$Probability, 10)
table(datagcut)
but then the first class has many cases and the last very few. You can define your own cuts if you like (see ?cut).
来源:https://stackoverflow.com/questions/18122427/finding-a-sensible-range