Proper way to access latest row for each individual identifier?

后端未结

关注

 5  581

轻奢々 2021-01-03 11:14

I have a table core_message in Postgres, with millions of rows that looks like this (simplified):

┌────────────────┬──


      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   陌清茗
                                             
                
                
                (楼主)
            
              
              
                2021-01-03 12:11
              

            
            
                        
This answer seems to go in the way of the DISTINCT ON answer here, however it also mentions this :


  For many rows per customer (low cardinality in column
  customer), a loose index scan (a.k.a. "skip scan") would be
  (much) more efficient, but that's not implemented up to Postgres 12.
  (An implementation for index-only scans is in development for Postgres
  13. See here and here.)

  For now, there are faster query techniques to substitute for this.
  In particular if you have a
  separate table holding unique customers, which is the typical use
  case. But also if you don't:
  
  
  Optimize GROUP BY query to retrieve latest row per user
  


Using this other great answer, I find a way to keep the same performance as a distinct table with the use of LATERAL.
By using a new table test_boats I can do something like this :

 CREATE TABLE test_boats AS (select distinct on (mmsi) mmsi from core_message);


This table creation take 40+ seconds which is pretty similar to the time taken by the other answer here.

Then, with the help of LATERAL :

SELECT a.mmsi, b.time
FROM test_boats a
CROSS JOIN LATERAL(
    SELECT b.time
    FROM core_message b
    WHERE a.mmsi = b.mmsi
    ORDER BY b.time DESC
    LIMIT 1
) b LIMIT 10;


This is blazingly fast, 1+ millisecond.

This will need the modification of my program's logic and the use of a query a bit more complex but I think I can live with that.

For a fast solution without the need to create a new table, check out the
answer of @ErwinBrandstetter below



UPDATE: I feel this question is not quite answered yet, as it's not very clear why the other solutions proposed perform poorly here.

I tried the benchmark mentionned here. At first, it would seem that the DISTINCT ON way is fast enough if you do a request like the one proposed in the benchmark : +/- 30ms on my computer.
But this is because that request uses index only scan. If you include a field that is not in the index, some_column in the case of the benchmark, the performance will drop to +/- 100ms.

Not a dramatic drop in performance yet.
That is why we need a benchmark with a bigger data set. Something similar to my case : 40K customers and 8M rows. Here

Let's try again the DISTINCT ON with this new table:

SELECT DISTINCT ON (customer_id) id, customer_id, total 
FROM purchases_more 
ORDER BY customer_id, total DESC, id;


This takes about 1.5 seconds to complete.

SELECT DISTINCT ON (customer_id) *
FROM purchases_more 
ORDER BY customer_id, total DESC, id;


This takes about 35 seconds to complete.

Now, to come back to my first solution above. It is using an index only scan and a LIMIT, that's one of the reason why it is extremely fast. If I recraft that query to not use index-only scan and dump the limit :

SELECT b.*
FROM test_boats a
CROSS JOIN LATERAL(
    SELECT b.*
    FROM core_message b
    WHERE a.mmsi = b.mmsi
    ORDER BY b.time DESC
    LIMIT 1
) b;


This will take about 500ms, which is still pretty fast.

For a more in-depth benchmark of sort, see my other answer below.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复