Trying to create grouped variable in python

后端 未结 4 1241
别那么骄傲
别那么骄傲 2020-12-19 23:01

I have a column of age values that I need to convert to age ranges of 18-29, 30-39, 40-49, 50-59, 60-69, and 70+:

For an example of some of the data in df \'file\',

4条回答
  •  太阳男子
    2020-12-19 23:54

    You can use itertools.groupby using // 10 as the key function.

    In [10]: ages = [random.randint(18, 99) for _ in range(100)]
    
    In [11]: [(key, list(group)) for (key, group) in itertools.groupby(sorted(ages), key=lambda x: x // 10)]
    Out[11]: 
    [(1, [18]),
     (2, [20, 21, 21, 22, 23, 24, 25, 26, 26, 26, 27, 27, 28]),
     (3, [30, 30, 32, 32, 34, 35, 36, 37, 37]),
     (4, [41, 42, 42, 43, 43, 44, 45, 47, 48]),
     (5, [50, 51, 52, 53, 54, 55, 56, 56, 56, 56, 57, 58, 58, 58, 58]),
     (6, [60, 61, 62, 62, 62, 65, 65, 66, 66, 66, 66, 67, 69, 69, 69]),
     (7, [71, 71, 72, 72, 73, 75, 75, 77, 77, 78]),
     (8, [83, 83, 83, 83, 84, 84, 85, 86, 86, 87, 87, 88, 89, 89, 89]),
     (9, [91, 91, 92, 92, 93, 94, 97, 97, 98, 98, 99, 99, 99])]
    

    Remember that groupby needs sorted data, though, so sort first. Or do it manually, using a dictionary and a loop.

    In [14]: groups = collections.defaultdict(list)
    
    In [15]: for x in ages:
       ....:     groups[x//10].append(x)
    
    In [16]: groups
    Out[16]: defaultdict(, {1: [18], 
                 2: [26, 28, 21, 20, 26, 24, 21, 27, 25, 23, 27, 26, 22], 
                 3: [37, 30, 32, 32, 35, 30, 36, 37, 34], 
                 4: [45, 42, 43, 41, 47, 43, 48, 44, 42], 
                 5: [52, 56, 58, 55, 58, 51, 58, 58, 57, 56, 53, 56, 50, 54, 56], 
                 6: [69, 65, 62, 61, 65, 66, 66, 62, 69, 66, 67, 66, 60, 62, 69], 
                 7: [71, 77, 71, 72, 77, 73, 78, 72, 75, 75], 
                 8: [87, 83, 84, 86, 86, 83, 83, 87, 85, 83, 89, 88, 84, 89, 89], 
                 9: [99, 92, 99, 98, 91, 94, 97, 92, 98, 97, 91, 93, 99]})
    

    For more complex grouping, you can make the key function arbitrarily complicated. E.g., for putting everybody at the age of 70 and above into one group, use lambda x: min(x // 10, 7). This works for both approaches. You can even convert the key to a string if you prefer that:

    In [23]: keyfunc = lambda x: "{0}0-{0}9".format(x//10) if x < 70 else "70+"
    In [24]: [(key, list(group)) for (key, group) in itertools.groupby(sorted(ages), key=keyfunc)]
    Out[24]: 
    [('10-19', [18]),
     ('20-29', [20, 21, 21, 22, 23, 24, 25, 26, 26, 26, 27, 27, 28]),
     ('30-39', [30, 30, 32, 32, 34, 35, 36, 37, 37]),
     ('40-49', [41, 42, 42, 43, 43, 44, 45, 47, 48]),
     ('50-59', [50, 51, 52, 53, 54, 55, 56, 56, 56, 56, 57, 58, 58, 58, 58]),
     ('60-69', [60, 61, 62, 62, 62, 65, 65, 66, 66, 66, 66, 67, 69, 69, 69]),
     ('70+',   [all the rest]]
    

提交回复
热议问题