Get last distinct set of records

后端 未结 5 633
既然无缘
既然无缘 2020-12-08 00:06

I have a database table containing the following columns:

id   code   value   datetime   timestamp

In this table the only unique values res

5条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-12-08 01:07

    Looks like all existing answers suggest to do GROUP BY code on the whole table. When it's logically correct, in reality this query will go through the whole(!) table (use EXPLAIN to make sure). In my case, I have less than 500k of rows in the table and executing ...GROUP BY codetakes 0.3 seconds which is absolutely not acceptable.

    However I can use knowledge of my data here (read as "show last comments for posts"):

    • I need to select just top-20 records
    • Amount of records with same code across last X records is relatively small (~uniform distribution of comments across posts, there are no "viral" post which got all the recent comments)
    • Total amount of records >> amount of available code's >> amount of "top" records you want to get

    By experimenting with numbers I found out that I can always find 20 different code if I select just last 50 records. And in this case following query works (keeping in mind @smdrager comment about high probability to use id instead of datetime)

    SELECT id, code
    FROM tablename
    ORDER BY id DESC 
    LIMIT 50
    

    Selecting just last 50 entries is super quick, because it doesn't need to check the whole table. And the rest is to select top-20 with distinct code out of those 50 entries.

    Obviously, queries on the set of 50 (100, 500) elements are significantly faster than on the whole table with hundreds of thousands entries.

    Raw SQL "Postprocessing"

    SELECT MAX(id) as id, code FROM 
        (SELECT id, code
         FROM tablename
         ORDER BY id DESC 
         LIMIT 50) AS nested 
    GROUP BY code
    ORDER BY id DESC 
    LIMIT 20
    

    This will give you list of id's really quick and if you want to perform additional JOINs, put this query as yet another nested query and perform all joins on it.

    Backend-side "Postprocessing"

    And after that you need to process the data in your programming language to include to the final set only the records with distinct code.

    Some kind of Python pseudocode:

    records = select_simple_top_records(50)
    added_codes = set()
    top_records = []
    for record in records:
        # If record for this code was already found before
        # Note: this is not optimal, better to use structure allowing O(1) search and insert
        if record['code'] in added_codes:
            continue
        # Save record
        top_records.append(record)
        added_codes.add(record['code'])
        # If we found all top-20 required, finish
        if len(top_records) >= 20:
            break
    

提交回复
热议问题