Finding a sensible range

问题

I'm struggling with this now for a few days. This is now the 3rd question at stackoverflow about the same topic, hope this time my question is better defined.

My data are distributed like this: (histogram)

The x-axis correspond to the range of probabilities: from 0 to 1.

I want to assign states from state 1 to state 10 sensibly to the probability range.

This is what I have got:

Interval <- round(quantile(datag, c(seq(0,1,by=0.10))),3)

output:

   0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
 0.000 0.008 0.015 0.024 0.036 0.054 0.080 0.124 0.209 0.397 1.000

Assign states from 0 to 10:

States <- data.frame(datag, State=findInterval(datag, Interval))

head(States)

Output: States

Probability      State
0.20585012         8
0.21202839         9
0.07087725         6
0.7109513         10
0.9641807         10

The problem is this: As you can see above, I have already state 9 for probability 0.2120 and state 10 for > 0.710. I would be happy with something like prob=0.2120 state 4 and prob=0.710 state 7 and prob=0.96 = state 10.

So how to assign states more uniformly?

To replicate the datag:

datag <- data.frame(Probability=rgamma(10000, shape=0.6, rate=4.8, scale=1/4.8))

EDIT: @Roman:

datag <- subset(datag, Probability<=1)

EDIT: @Simon

Yes, I'm aware of "cut":

table(cut(datag, breaks = c(seq(0,0.8,by=0.1))))

Output:

(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] 
125545     26625     12795      8126      5556      4108      3227      2606

How would one define the breaks? I after the intervals (breaks themselfs) so I can assign the states corresponding to the interval the probability falls in.

回答1:

You've basically got the answer in your OP! Don't take this the wrong way, but I think you need to spend some more time reading the documentation for ?cut! If you set labels = FALSE in cut you get the integer codes that each break corresponds to.

#  Set a seed for true reproducibility!
set.seed(1)
datag <- data.frame(Probability=rgamma(10000, shape=0.6, rate=4.8, scale=1/4.8))
Int <- cut( datag$Probability , breaks = seq(0 , 1 , by = 0.1 ) , lab = FALSE )
head( cbind( Prob = datag$Probability , Int ) )
            Prob Int
[1,] 0.031860645   1
[2,] 0.455054687   5
[3,] 0.134175238   2
[4,] 0.058957301   1
[5,] 0.855493999   9
[6,] 0.009144936   1

回答2:

I ran

datag <- data.frame(Probability=rgamma(10000, shape=0.6, rate=4.8, scale=1/4.8))
datag <- subset(datag, Probability<=1)

the first one gives a warning which apparently you ignored, but onward:

And if these are supposed to be probabilities, the second step shouldn't be needed. But onward

You used quantiles; datag is not uniform at all, so you got what you got. If you want to divide datag differently you can use cut. E.g, for 10 classes, evenly spaced:

datagcut <- cut(datag$Probability, 10)
table(datagcut)

but then the first class has many cases and the last very few. You can define your own cuts if you like (see ?cut).

来源：https://stackoverflow.com/questions/18122427/finding-a-sensible-range

标签

math

statistics