Terribly degraded performance with other join conditions in $lookup (using pipeline)

问题

So during some code review I decided to improve existing query performance by improving one aggregation that was like this:

    .aggregate([
        //difference starts here
        {
            "$lookup": {
                "from": "sessions",
                "localField": "_id",
                "foreignField": "_client",
                "as": "sessions"
            }
        },
        {
            $unwind: "$sessions"
        },
        {
            $match: {
                "sessions.deleted_at": null
            }
        },
        //difference ends here
        {
            $project: {
                name: client_name_concater,
                email: '$email',
                phone: '$phone',
                address: addressConcater,
                updated_at: '$updated_at',
            }
        }
    ]);

to this:

    .aggregate([
    //difference starts here
    {
        $lookup: {
            from: 'sessions',
            let: {
                id: "$_id"
            },
            pipeline: [
                {
                    $match: {
                        $expr: {
                            $and:
                                [
                                    {
                                        $eq: ["$_client", "$$id"]
                                    }, {
                                    $eq: ["$deleted_at", null]
                                },
                                ]
                        }
                    }
                }
            ],
            as: 'sessions'
        }
    },
    {
        $match: {
            "sessions": {$ne: []}
        }
    },
    //difference ends here
        {
            $project: {
                name: client_name_concater,
                email: '$email',
                phone: '$phone',
                address: addressConcater,
                updated_at: '$updated_at',
            }
        }
    ]);

I thought that the second option should be better, since we have one less stage, but the difference in performance is massive in the opposite way, the first query runs on average ~40ms, the other one ranges between 3.5 - 5 seconds, 100 times more. The other collection (sessions) has around 120 documents, while this one about 152, but still, even if it was acceptable due to data size, why the difference between these two, isn't it basically the same thing, we are just adding the join condition in the pipeline with the other main condition of the join. Am I missing something?

Some functions or variables included there are mostly static or concatenation that shouldn't affect the $lookup part.

Thanks

EDIT:

Added query plans, for version 1:

{
        "stages": [
            {
                "$cursor": {
                    "query": {
                        "$and": [
                            {
                                "deleted_at": null
                            },
                            {}
                        ]
                    },
                    "fields": {
                        "email": 1,
                        "phone": 1,
                        "updated_at": 1,
                        "_id": 1
                    },
                    "queryPlanner": {
                        "plannerVersion": 1,
                        "namespace": "test.clients",
                        "indexFilterSet": false,
                        "parsedQuery": {
                            "deleted_at": {
                                "$eq": null
                            }
                        },
                        "winningPlan": {
                            "stage": "COLLSCAN",
                            "filter": {
                                "deleted_at": {
                                    "$eq": null
                                }
                            },
                            "direction": "forward"
                        },
                        "rejectedPlans": []
                    }
                }
            },
            {
                "$lookup": {
                    "from": "sessions",
                    "as": "sessions",
                    "localField": "_id",
                    "foreignField": "_client",
                    "unwinding": {
                        "preserveNullAndEmptyArrays": false
                    }
                }
            },
            {
                "$project": {
                    "_id": true,
                    "email": "$email",
                    "phone": "$phone",
                    "updated_at": "$updated_at"
                }
            }
        ],
        "ok": 1
    }

For version 2:

{
        "stages": [
            {
                "$cursor": {
                    "query": {
                        "deleted_at": null
                    },
                    "fields": {
                        "email": 1,
                        "phone": 1,
                        "sessions": 1,
                        "updated_at": 1,
                        "_id": 1
                    },
                    "queryPlanner": {
                        "plannerVersion": 1,
                        "namespace": "test.clients",
                        "indexFilterSet": false,
                        "parsedQuery": {
                            "deleted_at": {
                                "$eq": null
                            }
                        },
                        "winningPlan": {
                            "stage": "COLLSCAN",
                            "filter": {
                                "deleted_at": {
                                    "$eq": null
                                }
                            },
                            "direction": "forward"
                        },
                        "rejectedPlans": []
                    }
                }
            },
            {
                "$lookup": {
                    "from": "sessions",
                    "as": "sessions",
                    "let": {
                        "id": "$_id"
                    },
                    "pipeline": [
                        {
                            "$match": {
                                "$expr": {
                                    "$and": [
                                        {
                                            "$eq": [
                                                "$_client",
                                                "$$id"
                                            ]
                                        },
                                        {
                                            "$eq": [
                                                "$deleted_at",
                                                null
                                            ]
                                        }
                                    ]
                                }
                            }
                        }
                    ]
                }
            },
            {
                "$match": {
                    "sessions": {
                        "$not": {
                            "$eq": []
                        }
                    }
                }
            },
            {
                "$project": {
                    "_id": true,
                    "email": "$email",
                    "phone": "$phone",
                    "updated_at": "$updated_at"
                }
            }
        ],
        "ok": 1
    }

One thing of note, the joined sessions collection has certain properties with very big data (some imported data), so I am thinking that in some way it may be affecting the query size due to this data? But why the difference between the two $lookup versions though.

回答1:

The second version adds an aggregation pipeline execution for each document in the joined collection.

The documentation says:

Specifies the pipeline to run on the joined collection. The pipeline determines the resulting documents from the joined collection. To return all documents, specify an empty pipeline [].

The pipeline is executed for each document in the collection, not for each matched document.

Depending on how large the collection is (both # of documents and document size) this could come out to a decent amount of time.

after removing the limit, the pipeline version jumped to over 10 seconds

Makes sense - all of the additional documents due to the removal of limit also must have the aggregation pipeline executed for them.

It is possible that per-document execution of aggregation pipeline isn't as optimized as it could be. For example, if the pipeline is set up and torn down for each document, there could easily be more overhead in that than in the $match conditions.

Is there any case when using one or the other?

Executing an aggregation pipeline per joined document provides additional flexibility. If you need this flexibility, it may make sense to execute the pipeline, though performance needs to be considered regardless. If you don't, it is sensible to use a more performant approach.

来源：https://stackoverflow.com/questions/61614202/terribly-degraded-performance-with-other-join-conditions-in-lookup-using-pipel

标签

node.js

mongodb

mongoose

aggregation-framework