MongoDB - Use aggregation framework or mapreduce for matching array of strings within documents (profile matching)

问题

I'm building an application that could be likened to a dating application.

I've got some documents with a structure like this:

$ db.profiles.find().pretty()

[
  {
    "_id": 1,
    "firstName": "John",
    "lastName": "Smith",
    "fieldValues": [
      "favouriteColour|red",
      "food|pizza",
      "food|chinese"
    ]
  },
  {
    "_id": 2,
    "firstName": "Sarah",
    "lastName": "Jane",
    "fieldValues": [
      "favouriteColour|blue",
      "food|pizza",
      "food|mexican",
      "pets|yes"
    ]
  },
  {
    "_id": 3,
    "firstName": "Rachel",
    "lastName": "Jones",
    "fieldValues": [
      "food|pizza"
    ]
  }
]

What I'm trying to so is identify profiles that match each other on one or more fieldValues.

So, in the example above, my ideal result would look something like:

<some query>

result:
[
  {
    "_id": "507f1f77bcf86cd799439011",
    "dateCreated": "2013-12-01",
    "profiles": [
      {
        "_id": 1,
        "firstName": "John",
        "lastName": "Smith",
        "fieldValues": [
          "favouriteColour|red",
          "food|pizza",
          "food|chinese"
        ]
      },
      {
        "_id": 2,
        "firstName": "Sarah",
        "lastName": "Jane",
        "fieldValues": [
          "favouriteColour|blue",
          "food|pizza",
          "food|mexican",
          "pets|yes"
        ]
      },

    ]
  },
  {
    "_id": "356g1dgk5cf86cd737858595",
    "dateCreated": "2013-12-02",
    "profiles": [
      {
        "_id": 1,
        "firstName": "John",
        "lastName": "Smith",
        "fieldValues": [
          "favouriteColour|red",
          "food|pizza",
          "food|chinese"
        ]
      },
      {
        "_id": 3,
        "firstName": "Rachel",
        "lastName": "Jones",
        "fieldValues": [
          "food|pizza"
        ]
      }
    ]
  }
]

I've thought about doing this either as a map reduce, or with the aggregation framework.

Either way, the 'result' would be persisted to a collection (as per the 'results' above)

My question is which of the two would be more suited? And where would I start to implement this?

Edit

In a nutshell, the model can't easily be changed.
This isn't like a 'profile' in the traditional sense.

What I'm basically looking to do (in psuedo code) is along the lines of:

foreach profile in db.profiles.find()
  foreach otherProfile in db.profiles.find("_id": {$ne: profile._id})
    if profile.fieldValues matches any otherProfie.fieldValues
      //it's a match!

Obviously that kind of operation is very very slow!

It may also be worth mentioning that this data is never displayed, it's literally just a string value that's used for 'matching'

回答1:

MapReduce would run JavaScript in a separate thread and use the code you provide to emit and reduce parts of your document to aggregate on certain fields. You can certainly look at the exercise as aggregating over each "fieldValue". Aggregation framework can do this as well but would be much faster as the aggregation would run on the server in C++ rather than in a separate JavaScript thread. But aggregation framework may return more data back than 16MB in which case you would need to do more complex partitioning of the data set.

But it seems like the problem is a lot simpler than this. You just want to find for each profile what other profiles share particular attributes with it - without knowing the size of your dataset, and your performance requirements, I'm going to assume that you have an index on fieldValues so it would be efficient to query on it and then you can get the results you want with this simple loop:

> db.profiles.find().forEach( function(p) { 
       print("Matching profiles for "+tojson(p));
       printjson(
            db.profiles.find(
               {"fieldValues": {"$in" : p.fieldValues},  
                                "_id" : {$gt:p._id}}
            ).toArray()
       ); 
 }  );

Output:

Matching profiles for {
    "_id" : 1,
    "firstName" : "John",
    "lastName" : "Smith",
    "fieldValues" : [
        "favouriteColour|red",
        "food|pizza",
        "food|chinese"
    ]
}
[
    {
        "_id" : 2,
        "firstName" : "Sarah",
        "lastName" : "Jane",
        "fieldValues" : [
            "favouriteColour|blue",
            "food|pizza",
            "food|mexican",
            "pets|yes"
        ]
    },
    {
        "_id" : 3,
        "firstName" : "Rachel",
        "lastName" : "Jones",
        "fieldValues" : [
            "food|pizza"
        ]
    }
]
Matching profiles for {
    "_id" : 2,
    "firstName" : "Sarah",
    "lastName" : "Jane",
    "fieldValues" : [
        "favouriteColour|blue",
        "food|pizza",
        "food|mexican",
        "pets|yes"
    ]
}
[
    {
        "_id" : 3,
        "firstName" : "Rachel",
        "lastName" : "Jones",
        "fieldValues" : [
            "food|pizza"
        ]
    }
]
Matching profiles for {
    "_id" : 3,
    "firstName" : "Rachel",
    "lastName" : "Jones",
    "fieldValues" : [
        "food|pizza"
    ]
}
[ ]

Obviously you can tweak the query to not exclude already matched up profiles (by changing {$gt:p._id} to {$ne:{p._id}} and other tweaks. But I'm not sure what additional value you would get from using aggregation framework or mapreduce as this is not really aggregating a single collection on one of its fields (judging by the format of the output that you show). If your output format requirements are flexible, certainly it's possible that you could use one of the built in aggregation options as well.

I did check to see what this would look like if aggregating around individual fieldValues and it's not bad, it might help you if your output can match this:

> db.profiles.aggregate({$unwind:"$fieldValues"}, 
      {$group:{_id:"$fieldValues", 
              matchedProfiles : {$push:
               {  id:"$_id", 
                  name:{$concat:["$firstName"," ", "$lastName"]}}},
                  num:{$sum:1}
               }}, 
      {$match:{num:{$gt:1}}});
{
    "result" : [
        {
            "_id" : "food|pizza",
            "matchedProfiles" : [
                {
                    "id" : 1,
                    "name" : "John Smith"
                },
                {
                    "id" : 2,
                    "name" : "Sarah Jane"
                },
                {
                    "id" : 3,
                    "name" : "Rachel Jones"
                }
            ],
            "num" : 3
        }
    ],
    "ok" : 1
}

This basically says "For each fieldValue ($unwind) group by fieldValue an array of matching profile _ids and names, counting how many matches each fieldValue accumulates ($group) and then exclude the ones that only have one profile matching it.

回答2:

First, in distinguishing between the two, MongoDB's aggregation framework is basically just mapreduce, but more limited so that it can provide a more straightforward interface. To my knowledge, the aggregation framework cannot do anything more than general mapreduce.

With that in mind, the question then becomes: is your transformation something that can be modeled in the aggregation framework, or do you need to fall back to the more powerful mapreduce.

If I understand what you're trying to do, I think it is feasible with the aggregation framework if you change your schema a bit. Schema design is one of the trickiest things with Mongo, and you need to take a lot of things into consideration when deciding how to structure your data. Despite knowing very little about your application, I'm going to go out on a limb and make a suggestion anyway.

Specifically, I'd suggest changing the way you structure your fieldValues subdocument into something like this:

{
    "_id": 2,
    "firstName": "Sarah",
    "lastName": "Jane",
    "likes": {
        "colors": ["blue"],
        "foods": ["pizza", "mexican"],
        "pets": true
    }
}

That is, store the multi-valued attributes in an array. This would allow you to take advantage of the aggregation framework's $unwind operator. (See the example in the Mongo documentation.) But, depending on what you're trying to accomplish, this may or may not be appropriate.

Taking a step back, though, you may not find it appropriate to use the aggregation framework or Mongo's mapreduce function. Their use has performance implications, and it may not be a good idea to employ them for your application's core business logic. Generally, their intended use seems to be for infrequent or ad-hoc queries simply to gain insight into one's data. So, you may be better off starting with a "real" mapreduce framework. That said, I have heard cases where the aggregation framework is used in a cron job to create core business data on a regular basis.

来源：https://stackoverflow.com/questions/16310730/mongodb-use-aggregation-framework-or-mapreduce-for-matching-array-of-strings-w

标签

mongodb

MapReduce

aggregation-framework