SQL - find all instances where two columns are the same

问题

So I have a simple table that holds comments from a user that pertain to a specific blog post.

id  |  user           |  post_id  |  comment
----------------------------------------------------------
0   | john@test.com   |  1001     |  great article
1   | bob@test.com    |  1001     |  nice post
2   | john@test.com   |  1002     |  I agree
3   | john@test.com   |  1001     |  thats cool
4   | bob@test.com    |  1002     |  thanks for sharing
5   | bob@test.com    |  1002     |  really helpful
6   | steve@test.com  |  1001     |  spam post about pills

I want to get all instances where a user commented on the same post twice (meaning same user and same post_id). In this case I would return:

id  |  user           |  post_id  |  comment
----------------------------------------------------------
0   | john@test.com   |  1001     |  great article
3   | john@test.com   |  1001     |  thats cool
4   | bob@test.com    |  1002     |  thanks for sharing
5   | bob@test.com    |  1002     |  really helpful

I thought DISTINCT was what I needed but that just gives me unique rows.

回答1:

You can use GROUP BY and HAVING to find pairs of user and post_id that have multiple entries:

  SELECT a.*
  FROM table_name a
  JOIN (SELECT user, post_id
        FROM table_name
        GROUP BY user, post_id
        HAVING COUNT(id) > 1
        ) b
  ON a.user = b.user
  AND a.post_id = b.post_id

回答2:

DISTINCT removes all duplicate rows, which is why you're getting unique rows.

You can try using a CROSS JOIN (available as of Hive 0.10 according to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins):

SELECT mt.*
FROM MYTABLE mt
CROSS JOIN MYTABLE mt2
WHERE mt.user = mt2.user
AND mt.post_id = mt2.post_id

The performance might not be the best though. If you wanted to sort it, use SORT BY or ORDER BY.

回答3:

DECLARE @MyTable TABLE (id int, usr varchar(50), post_id int, comment varchar(50))
INSERT @MyTable (id, usr, post_id, comment) VALUES (0,'john@test.com',1001,'great article')
INSERT @MyTable (id, usr, post_id, comment) VALUES (1,'bob@test.com',1001,'nice post')
INSERT @MyTable (id, usr, post_id, comment) VALUES (3,'john@test.com',1002,'I agree')
INSERT @MyTable (id, usr, post_id, comment) VALUES (4,'john@test.com',1001,'thats cool')
INSERT @MyTable (id, usr, post_id, comment) VALUES (5,'bob@test.com',1002,'thanks for sharing')
INSERT @MyTable (id, usr, post_id, comment) VALUES (6,'bob@test.com',1002,'really helpful')
INSERT @MyTable (id, usr, post_id, comment) VALUES (7,'steve@test.com',1001,'spam post about pills')

SELECT
    T1.id,
    T1.usr,
    T1.post_id,
    T1.comment
FROM
    @MyTable T1

    INNER JOIN @MyTable T2
    ON T1.usr = T2.usr AND T1.post_id = T2.post_id
GROUP BY
    T1.id,
    T1.usr,
    T1.post_id,
    T1.comment
HAVING
    Count(T2.id) > 1

来源：https://stackoverflow.com/questions/27726186/sql-find-all-instances-where-two-columns-are-the-same

标签

sql

Hive

hiveql