How to fetch rows with max update datetime using GROUP BY and HAVING with SQLAlchemy and Postgresql

问题

I'm going from SQLite to Postgresql. This has made one of my queries not work. It's not clear to me why this query is allowed in SQLite, but not in Postgresql. The query in question is below in the find_recent_by_section_id_list() function.

I've tried rewriting the query in multiple ways, but what is confusing me is that this query worked when I was working with SQLite.

The setup is Flask, SQLAlchemy, Flask-SQLAlchemy and Postgresql.

class SectionStatusModel(db.Model):

    __tablename__ = "sectionstatus"
    _id = db.Column(db.Integer, primary_key=True)
    update_datetime = db.Column(db.DateTime, nullable=False)
    status = db.Column(db.Integer, nullable=False, default=0)
    section_id = db.Column(db.Integer, db.ForeignKey("sections._id"), nullable=False)

    __table_args__ = (
        UniqueConstraint("section_id", "update_datetime", name="section_time"),
    )


    @classmethod
    def find_recent_by_section_id_list(
        cls, section_id_list: List
    ) -> List["SectionStatusModel"]:

        return (
            cls.query.filter(cls.section_id.in_(section_id_list))
            .group_by(cls.section_id)
            .having(func.max(cls.update_datetime) == cls.update_datetime)
        )

I would expect that this query would return the latest section statuses, for each section, however I get the following error:

E       sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) column "sectionstatus._id" must appear in the GROUP BY clause or be used in an aggregate function
E       LINE 1: SELECT sectionstatus._id AS sectionstatus__id, sectionstatus...
E                      ^
E       
E       [SQL: SELECT sectionstatus._id AS sectionstatus__id, sectionstatus.update_datetime AS sectionstatus_update_datetime, sectionstatus.status AS sectionstatus_status, sectionstatus.section_id AS sectionstatus_section_id 
E       FROM sectionstatus 
E       WHERE sectionstatus.section_id IN (%(section_id_1)s, %(section_id_2)s) GROUP BY sectionstatus.section_id 
E       HAVING max(sectionstatus.update_datetime) = sectionstatus.update_datetime]
E       [parameters: {'section_id_1': 1, 'section_id_2': 2}]
E       (Background on this error at: http://sqlalche.me/e/f405)

This is the output from a test suite.

回答1:

The query is allowed in SQLite since it allows SELECT list items to refer to ungrouped columns outside of aggregate functions, or without said columns being functionally dependent on the grouping expressions. The non-aggregate values are picked from an arbitrary row in the group.

In addition it is documented in a sidenote that special processing of "bare" columns in an aggregate query occurs, when the aggregate is min() or max() ¹:

When the min() or max() aggregate functions are used in an aggregate query, all bare columns in the result set take values from the input row which also contains the minimum or maximum.

This only applies to simple queries and there is again ambiguity, if more than 1 rows have the same min/max, or the query contains more than 1 call to min() / max().

This makes SQLite non-conforming in this respect, at least with the SQL:2003 standard (I'm fairly certain that this has not changed much in the newer versions):

7.12 <query specification>

Function

Specify a table derived from the result of a <table expression>.

Format
<query specification> ::=
    SELECT [ <set quantifier> ] <select list> <table expression>
...

Conformance Rules

...

3) Without Feature T301, “Functional dependencies”, in conforming SQL language, if T is a grouped table, then in each <value expression> contained in the <select list>, each <column reference> that references a column of T shall reference a grouping column or be specified in an aggregated argument of a <set function specification>.

Most other SQL DBMS, such as Postgresql, follow the standard more closely in this respect, and require that the SELECT list of an aggregate query consist of only grouping expressions, aggregate expressions, or that any ungrouped columns are functionally dependent on the grouped columns.

In Postgresql a different approach is then required in order to fetch this kind of greatest-n-per-group result. There are many great posts that cover this topic, but here's a summary of one Postgresql specific approach. Using the DISTINCT ON extension combined with ORDER BY you can achieve the same results:

@classmethod
def find_recent_by_section_id_list(
        cls, section_id_list: List) -> List["SectionStatusModel"]:
    return (
        cls.query
        .filter(cls.section_id.in_(section_id_list))
        .distinct(cls.section_id)
        # Use _id as a tie breaker, in order to avoid non-determinism
        .order_by(cls.section_id, cls.update_datetime.desc(), cls._id)
    )

Naturally this will then break in SQLite, as it does not support DISTINCT ON. If you need a solution that works in both, use the row_number() window function approach.

^{1: Note that this means that your HAVING clause is in fact not much filtering at all, since the ungrouped value will always be picked from the row containing the maximum value. It is the mere presence of that max(update_datetime) that does the trick.}

来源：https://stackoverflow.com/questions/55419442/how-to-fetch-rows-with-max-update-datetime-using-group-by-and-having-with-sqlalc

标签

python

postgresql

sqlite

group-by

sqlalchemy