R: creating a categorical variable from a numerical variable and custom/open-ended/single-valued intervals

后端 未结 2 2007
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-05 03:54

I often find myself trying to create a categorical variable from a numerical variable + a user-provided set of ranges.

For instance, say that I have a data.frame wi

相关标签:
2条回答
  • 2021-01-05 04:35

    Use cut(), already:

    df$VCAT2 <- cut(df$V, c(0,9.999,10,20,Inf), labels=F)
    

    Notice the trick I pull to define a very small bin at 10:

    • (and if you need that bin to be infinitesimally narrow, use 10 - 10*.Machine$double.eps)
    • you can manually define your desired labels '(0,10)','[10,10]',(10,20), [20,Inf]' with the cut(..., labels) argument.
    0 讨论(0)
  • 2021-01-05 04:47

    A way I bin numbers is to remove the remainder using the modulus opperator, %%. E.g. to bin into groups of 20:

    #create raw data
    unbinned<-c(1.1,1.53,5,8.3,33.5,49.22,55,57.9,79.6,81,95,201,213)
    rawdata<-as.data.frame(unbinned)
    
    #bin the data into groups of 20
    binneddata<-mutate(rawdata,binned=unbinned-unbinned %% 20)
    
    #print the data
    binneddata
    

    This produces the output:

       unbinned binned
    1      1.10      0
    2      1.53      0
    3      5.00      0
    4      8.30      0
    5     33.50     20
    6     49.22     40
    7     55.00     40
    8     57.90     40
    9     79.60     60
    10    81.00     80
    11    95.00     80
    12   201.00    200
    13   213.00    200
    

    So 0 represents 0-<20, 20 represents 20-<40, 40 ,40-<60 etc. (of course divide the binned value by 20 to get sequential groups like in the original question)

    Bonus

    If you want to use the binned values as categorical variables in ggplot etc. by converting them into strings, they will order strangely, e.g. 200 will come before 40, because '2' comes before '4' in the alphabet, to get around this, use the sprintf function to create leading zeros. (the 3 in %03d should be the number of digits you expect the longest number to be):

    #convert the data into strings with leading zeros
    binnedstring<-mutate(binneddata,bin_as_character=sprintf('%03d',binned))
    
    #print the data
    binnedstring
    

    giving the output:

       unbinned binned bin_as_character
    1      1.10      0              000
    2      1.53      0              000
    3      5.00      0              000
    4      8.30      0              000
    5     33.50     20              020
    etc.
    

    If you want to have 000-<020, create the upper bound using arithmetic and concatenate using the paste function:

    #make human readable bin value
    binnedstringband<-mutate(
        binnedstring,
        nextband=binned+20,
        human_readable=paste(bin_as_character,'-<',sprintf('%03d',nextband),sep='')
    )
    
    #print the data
    binnedstringband
    

    Giving:

       unbinned binned bin_as_character nextband     human_readable
    1      1.10      0              000       20           000-<020
    2      1.53      0              000       20           000-<020
    3      5.00      0              000       20           000-<020
    4      8.30      0              000       20           000-<020
    5     33.50     20              020       40           020-<040
    etc.
    
    0 讨论(0)
提交回复
热议问题