SQL-style GROUP BY aggregate functions in jq (COUNT, SUM and etc)

江枫思渺然 提交于 2019-12-06 08:40:15

Extended jq solution:

Custom count() function:

jq -sc 'def count($k): group_by(.[$k])[] | length as $l | .[0] 
                       | .pets_count = $l 
                       | del(.pet_id, .pet, .litter); 
        count("owner_id")' source.data

The output:

{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}

Custom sum() function:

jq -sc 'def sum($k): group_by(.[$k])[] | map(.litter) as $litters | .[0] 
                     | . + {litter_total: $litters | add, litter_max: $litters | max} 
                     | del(.pet_id, .pet, .litter); 
        sum("owner_id")' source.data

The output:

{"owner_id":1,"owner":"Adams","age":25,"litter_total":6,"litter_max":4}
{"owner_id":2,"owner":"Baker","age":55,"litter_total":3,"litter_max":3}
{"owner_id":3,"owner":"Clark","age":40,"litter_total":4,"litter_max":4}
{"owner_id":4,"owner":"Davis","age":31,"litter_total":9,"litter_max":4}

Custom array_agg() function:

jq -sc 'def array_agg($k): group_by(.[$k])[] | map(.pet) as $pets | .[0] 
                           | .pets = $pets | del(.pet_id, .pet, .litter); 
        array_agg("owner_id")' source.data

The output:

{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}

This is a nice exercise, but SO is not a programming service, so I will focus here on some key concepts for generic solutions in jq that are efficient, even for very large collections.

GROUPS_BY

The key to efficiency here is avoiding the built-in group_by, as it requires sorting. Since jq is fundamentally stream-oriented, the following definition of GROUPS_BY is likewise stream-oriented. It takes advantage of the efficiency of key-based lookups, while avoiding calling tojson on strings:

# emit a stream of the groups defined by f
def GROUPS_BY(stream; f): 
  reduce stream as $x ({};
     ($x|f) as $s
     | ($s|type) as $t
     | (if $t == "string" then $s else ($s|tojson) end) as $y
     | .[$t][$y] += [$x] )
   | .[][] ;

distinct and count_distinct

# Emit an array of the distinct entities in `stream`, without sorting
def distinct(stream): 
  reduce stream as $x ({};
      ($x|type) as $t
      | (if $t == "string" then $x else ($x|tojson) end) as $y
      | if (.[$t] | has($y)) then . else .[$t][$y] += [$x] end )
   | [.[][]] | add ;


# Emit the number of distinct items in the given stream
def count_distinct(stream):
   def sum(s): reduce s as $x (0;.+$x);
   reduce stream as $x ({};
       ($x|type) as $t
       | (if $t == "string" then $x else ($x|tojson) end) as $y
       | .[$t][$y] = 1 )
   | sum( .[][] ) ;

Convenience function

def owner: {owner_id,owner,age};

Example: "COUNT the number of pets per owner"

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets_count: count_distinct(.[]|.pet_id)}

Invocation: jq -nc -f program1.jq input.json

Output:

{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}

Example: "SUM up the number of whelps per owner and get their MAX"

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner)
  + {litter_total: (map(.litter) | add)}
  + {litter_max:  (map(.litter) | max)}

Invocation: jq -nc -f program2.jq input.json

Output: as given.

Example: "ARRAY_AGG pets per owner"

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets: distinct(.[]|.pet)}

Invocation: jq -nc -f program3.jq input.json

Output:

{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!