SQL-style GROUP BY aggregate functions in jq (COUNT, SUM and etc)

这一生的挚爱 提交于 2019-12-10 11:16:41

问题


Similar questions asked here before:

Count items for a single key: jq count the number of items in json by a specific key

Calculate the sum of object values: How do I sum the values in an array of maps in jq?

Question

How to emulate the COUNT aggregate function which should behave similarly to its SQL original? Let's extend this question even more to include other regular SQL functions:

  • COUNT
  • SUM / MAX/ MIN / AVG
  • ARRAY_AGG

The last one is not a standard SQL function - it's from PostgreSQL but is quite useful.

At input comes a stream of valid JSON objects. For demonstration let's pick a simple story of owners and their pets.

Model and data

Base relation: Owner

id name  age
 1 Adams  25
 2 Baker  55
 3 Clark  40
 4 Davis  31

Base relation: Pet

id name  litter owner_id
10 Bella      4        1
20 Lucy       2        1
30 Daisy      3        2
40 Molly      4        3
50 Lola       2        4
60 Sadie      4        4
70 Luna       3        4

Source

From above we get a derivative relation Owner_Pet (a result of SQL JOIN of the above relations) presented in JSON format for our jq queries (the source data):

{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 10, "pet": "Bella", "litter": 4 }
{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 20, "pet": "Lucy",  "litter": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pet_id": 30, "pet": "Daisy", "litter": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pet_id": 40, "pet": "Molly", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 50, "pet": "Lola",  "litter": 2 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 60, "pet": "Sadie", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 70, "pet": "Luna",  "litter": 3 }

Requests

Here are sample requests and their expected output:

  • COUNT the number of pets per owner:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets_count": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets_count": 1 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets_count": 1 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets_count": 3 }
  • SUM up the number of whelps per owner and get their MAX (MIN/AVG):
{ "owner_id": 1, "owner": "Adams", "age": 25, "litter_total": 6, "litter_max": 4 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "litter_total": 3, "litter_max": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "litter_total": 4, "litter_max": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "litter_total": 9, "litter_max": 4 }
  • ARRAY_AGG pets per owner:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets": [ "Bella", "Lucy" ] }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets": [ "Daisy" ] }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets": [ "Molly" ] }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets": [ "Lola", "Sadie", "Luna" ] }

回答1:


Extended jq solution:

Custom count() function:

jq -sc 'def count($k): group_by(.[$k])[] | length as $l | .[0] 
                       | .pets_count = $l 
                       | del(.pet_id, .pet, .litter); 
        count("owner_id")' source.data

The output:

{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}

Custom sum() function:

jq -sc 'def sum($k): group_by(.[$k])[] | map(.litter) as $litters | .[0] 
                     | . + {litter_total: $litters | add, litter_max: $litters | max} 
                     | del(.pet_id, .pet, .litter); 
        sum("owner_id")' source.data

The output:

{"owner_id":1,"owner":"Adams","age":25,"litter_total":6,"litter_max":4}
{"owner_id":2,"owner":"Baker","age":55,"litter_total":3,"litter_max":3}
{"owner_id":3,"owner":"Clark","age":40,"litter_total":4,"litter_max":4}
{"owner_id":4,"owner":"Davis","age":31,"litter_total":9,"litter_max":4}

Custom array_agg() function:

jq -sc 'def array_agg($k): group_by(.[$k])[] | map(.pet) as $pets | .[0] 
                           | .pets = $pets | del(.pet_id, .pet, .litter); 
        array_agg("owner_id")' source.data

The output:

{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}



回答2:


This is a nice exercise, but SO is not a programming service, so I will focus here on some key concepts for generic solutions in jq that are efficient, even for very large collections.

GROUPS_BY

The key to efficiency here is avoiding the built-in group_by, as it requires sorting. Since jq is fundamentally stream-oriented, the following definition of GROUPS_BY is likewise stream-oriented. It takes advantage of the efficiency of key-based lookups, while avoiding calling tojson on strings:

# emit a stream of the groups defined by f
def GROUPS_BY(stream; f): 
  reduce stream as $x ({};
     ($x|f) as $s
     | ($s|type) as $t
     | (if $t == "string" then $s else ($s|tojson) end) as $y
     | .[$t][$y] += [$x] )
   | .[][] ;

distinct and count_distinct

# Emit an array of the distinct entities in `stream`, without sorting
def distinct(stream): 
  reduce stream as $x ({};
      ($x|type) as $t
      | (if $t == "string" then $x else ($x|tojson) end) as $y
      | if (.[$t] | has($y)) then . else .[$t][$y] += [$x] end )
   | [.[][]] | add ;


# Emit the number of distinct items in the given stream
def count_distinct(stream):
   def sum(s): reduce s as $x (0;.+$x);
   reduce stream as $x ({};
       ($x|type) as $t
       | (if $t == "string" then $x else ($x|tojson) end) as $y
       | .[$t][$y] = 1 )
   | sum( .[][] ) ;

Convenience function

def owner: {owner_id,owner,age};

Example: "COUNT the number of pets per owner"

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets_count: count_distinct(.[]|.pet_id)}

Invocation: jq -nc -f program1.jq input.json

Output:

{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}

Example: "SUM up the number of whelps per owner and get their MAX"

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner)
  + {litter_total: (map(.litter) | add)}
  + {litter_max:  (map(.litter) | max)}

Invocation: jq -nc -f program2.jq input.json

Output: as given.

Example: "ARRAY_AGG pets per owner"

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets: distinct(.[]|.pet)}

Invocation: jq -nc -f program3.jq input.json

Output:

{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}


来源:https://stackoverflow.com/questions/48321235/sql-style-group-by-aggregate-functions-in-jq-count-sum-and-etc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!