Finding unique users from linked values

大憨熊 提交于 2020-01-06 06:40:50

问题


I have values in my Table of this form.

id | val1  | val2 
--------------------
1  |   e1  |   m1
2  |   e1  |   m2
3  |   e2  |   m2
4  |   e3  |   m1
5  |   e4  |   m3
6  |   e5  |   m3
7  |   e5  |   m4
8  |   e4  |   m5

From this, I have to recover unique users like this and give them a unique id to identify.

User1 -> (val1 : e1, e2, e3 | val2: m1, m2)

e1 <-> m1, e1 <-> m2, m1 <-> e3, e2 <-> m2 ( <-> means linked).

e1 is connected to m1.

e1 is connected to m2.

m2 is connected to e2.

So e1,m1 are connected to e2.

Similarly, we find e1, e2, e3, m1, m2 all are linked. We need to identify these chains.


User2 -> (val1 : e4, e5 | val2: m3, m4, m5)

I have written two queries based on grouping my val1 and then by val2 separately and joining them in code (Java).

I want this to do this directly in MySQL/BigQuery query itself as we are building some reports on this.

Is this possible in a single query? Please help.

Thank you.

Update :

Desired output -

[
 { 
   id : user1,
   val1 : [e1, e2, e3],
   val2 : [m1, m2]
 },
 { 
   id : user2,
   val1 : [e4, e5],
   val2 : [m3, m4, m5]
 }
]

or

id | val1  | val2 | UUID
------------------------
1  |   e1  |   m1 | u1
2  |   e1  |   m2 | u1
3  |   e2  |   m2 | u1
4  |   e3  |   m1 | u1
5  |   e4  |   m3 | u2
6  |   e5  |   m3 | u2
7  |   e5  |   m4 | u2
8  |   e4  |   m5 | u2

To make it simple, assuming values of val1 and val2 are nodes and are connected if present in the same row.

The rows of the table form graphs (user1, user2) and we need to identify these graphs.


回答1:


Wanted to jump-in with option of solving your task with pure BigQuery (Standard SQL)

Pre-requisites / assumptions: source data is in sandbox.temp.id1_id2_pairs
You should replace this with your own or if you want to test with dummy data from your question - you can create this table as below (of course replace sandbox.temp with your own project.dataset)


Make sure you set respective destination table

Note: you can find all respective Queries (as text) at the bottom of this answer, but for now I am illustrating my answer with screenshots - so all is presented - query, result and used options

So, there will be three steps:

Step 1 - Initialization

Here, we just do initial grouping of id1 based on connections with id2:

As you can see here - we created list of all id1 values with respective connections based on simple one-level connection through id2

Output table is sandbox.temp.groups

Step 2 - Grouping Iterations

In each iteration we will enrich grouping based on already established groups.
Source of Query is output table of previous Step (sandbox.temp.groups) and Destination is the same table (sandbox.temp.groups) with Overwrite

We will continue iterations till when count of found groups will be the same as in previous iteration

Note: you can just have two BigQuery Web UI Tabs opened (as it is shown above) and without changing any code just run Grouping and then Check again and again till iteration converge

(for specific data that I used in pre-requisites section - I had three iterations - first iteration produced 5 users, second iteration produced 3 users and third iteration produced again 3 users - which indicated that we done with iterations.

Of course, in real life case - number of iterations could be more than just three - so we need some sort of automation (see respective section at the bottom of answer).

Step 3 – Final Grouping
When id1 grouping is completed - we can add final grouping for id2

Final result now is in sandbox.temp.users table

Used Queries (do not forget to set respective destination tables and overwrites when needed as per above described logic and screenshots):

Pre-requisites:

#standardSQL
SELECT 1 id, 'e1' id1, 'm1' id2 UNION ALL
SELECT 2,    'e1',     'm2' UNION ALL
SELECT 3,    'e2',     'm2' UNION ALL
SELECT 4,    'e3',     'm1' UNION ALL
SELECT 5,    'e4',     'm3' UNION ALL
SELECT 6,    'e5',     'm3' UNION ALL
SELECT 7,    'e5',     'm4' UNION ALL
SELECT 8,    'e4',     'm5' UNION ALL
SELECT 9,    'e6',     'm6' UNION ALL
SELECT 9,    'e7',     'm7' UNION ALL
SELECT 9,    'e2',     'm6' UNION ALL
SELECT 888,  'e4',     'm55'   

Step 1

#standardSQL
WITH `yourTable` AS (select * from `sandbox.temp.id1_id2_pairs`
), x1 AS (SELECT id1, STRING_AGG(id2) id2s FROM `yourTable` GROUP BY id1
), x2 AS (SELECT id2, STRING_AGG(id1) id1s FROM `yourTable` GROUP BY id2 
), x3 AS (
  SELECT id, (SELECT STRING_AGG(i ORDER BY i) FROM (
    SELECT DISTINCT i FROM UNNEST(SPLIT(id1s)) i)) grp
  FROM (
    SELECT x1.id1 id, STRING_AGG((id1s)) id1s FROM x1 CROSS JOIN x2
    WHERE EXISTS (SELECT y FROM UNNEST(SPLIT(id1s)) y WHERE x1.id1 = y)
    GROUP BY id1) 
)
SELECT * FROM x3 

Step 2 - Grouping

#standardSQL
WITH x3 AS (select * from `sandbox.temp.groups`)
SELECT id, (SELECT STRING_AGG(i ORDER BY i) FROM (
  SELECT DISTINCT i FROM UNNEST(SPLIT(grp)) i)) grp
FROM (
  SELECT a.id, STRING_AGG(b.grp) grp FROM x3 a CROSS JOIN x3 b 
  WHERE EXISTS (SELECT y FROM UNNEST(SPLIT(b.grp)) y WHERE a.id = y)
  GROUP BY a.id )   

Step 2 - Check

#standardSQL
SELECT COUNT(DISTINCT grp) users FROM `sandbox.temp.groups` 

Step 3

#standardSQL
WITH `yourTable` AS (select * from `sandbox.temp.id1_id2_pairs`
), x1 AS (SELECT id1, STRING_AGG(id2) id2s FROM `yourTable` GROUP BY id1 
), x3 as (select * from `sandbox.temp.groups`
), f  AS (SELECT DISTINCT grp FROM x3 ORDER BY grp
)
SELECT ROW_NUMBER() OVER() id, grp id1, 
  (SELECT STRING_AGG(i ORDER BY i) FROM (SELECT DISTINCT i FROM UNNEST(SPLIT(id2)) i)) id2
FROM (
  SELECT grp, STRING_AGG(id2s) id2 FROM f 
  CROSS JOIN x1 WHERE EXISTS (SELECT y FROM UNNEST(SPLIT(f.grp)) y WHERE id1 = y)
  GROUP BY grp)

Automation:
Of course, above "process" can be executed manually in case if iterations converge fast - so you will end up with 10-20 runs. But in more real-life cases you can easily automate this with any client of your choice



来源:https://stackoverflow.com/questions/47357176/finding-unique-users-from-linked-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!