Is it advisable to use MapReduce to 'flatten' irregular entities in CouchDB?

问题

In a question on CouchDB I asked previously (Can you implement document joins using CouchDB 2.0 'Mango'?), the answer mentioned creating domain objects instead of storing relational data in Couch.

My use case, however, is not necessarily to store relational data in Couch but to flatten relational data. For example, I have the entity of Invoice that I collect from several suppliers. So I have two different schemas for that entity.

So I might end up with 2 docs in Couch that look like this:

{
    "type": "Invoice",
    "subType": "supplier B",
    "total": 22.5,
    "date": "10 Jan 2017",
    "customerName": "me"
}

{
    "type": "Invoice",
    "subType": "supplier A",
    "InvoiceTotal": 10.2,
    "OrderDate": <some other date format>,
    "customerName": "me"
}

I also have a doc like this:

{
    "type": "Customer",
    "name": "me",
    "details": "etc..."
}

My intention then is to 'flatten' the Invoice entities, and then join on the reduce function. So, the map function looks like this:

function(doc) {
    switch(doc.type) {
        case 'Customer':
            emit(doc.customerName, { doc information ..., type: "Customer" });
            break;
        case 'Invoice':
            switch (doc.subType) {
                case 'supplier B':
                    emit (doc.customerName, { total:  doc.total, date: doc.date, type: "Invoice"});
                    break;

                case 'supplier A':
                    emit (doc.customerName, { total:  doc.InvoiceTotal, date: doc.OrderDate, type: "Invoice"});
                    break;
            }
            break;
    }
}

Then I would use the reduce function to compare docs with the same customerName (i.e. a join).

Is this advisable using CouchDB? If not, why?

回答1:

First of all apologizes for getting back to you late, I thought I'd look at it directly but I haven't been on SO since we exchanged the other day.

Reduce functions should only be used to reduce scalar values, not to aggregate data. So you wouldn't use them to achieve things such as doing joins, or removing duplicates, but you would for example use them to compute the number of invoices per customer - you see the idea. The reason is you can only make weak assumptions with regards to the calls made to your reduce functions (order in which records are passed, rereduce parameter, etc...) so you can easily end up with serious performance problems.

But this is by design since the intended usage of reduce functions is to reduce scalar values. An easy way to think about it is to say that no filtering should ever happen in a reduce function, filtering and things such as checking keys should be done in map.

If you just want to compare docs with the same customer name you do not need a reduce function at all, you can query your view the following parameters:

startkey=["customerName"]
endkey=["customerName", {}]

Otherwise you may want to create a separate view to filter on customers first, and return their names and then use these names to query your view in a bulk manner using the keys view parameter. Startkey/endkey is good if you only want to filter one customer at a time, and/or need to match complex keys in a partial way.

If what you are after are the numbers, you may want to do :

if(doc.type == "Invoice") {
    emit([doc.customerName, doc.supplierName, doc.date], doc.amount)
}

And then use the _stats built-in reduce function to get statistics on the amount (sum, min, max,)

So that to get the amount spent with a supplier, you'd just need to make a reduce query to your view, and use the parameter group_level=2 to aggregate by the first 2 elements of the key. You can combine this with startkey and endkey to filter specific values of this key :

startkey=["name1", "supplierA"]
endkey=["name1", "supplierA", {}]

You can then build from this example to do things such as :

if(doc.type == "Invoice") {
    emit(["BY_DATE", doc.customerName, doc.date], doc.amount);
    emit(["BY_SUPPLIER", doc.customerName, doc.supplierName], doc.amount);
    emit(["BY_SUPPLIER_AND_DATE", doc.customerName, doc.supplierName, doc.date], doc.amount);
}

Hope this helps

回答2:

It is totally ok to "normalize" your different schemas (or subTypes) via a view. You cannot create views based on those normalized schemas, though, and on the long run, it might be hard to manage different schemas.

The better solution might be to normalize the documents before writing them to CouchDB. If you still need the documents in their original schema, you can add a sub-property original where you store your documents in their original form. This would make working on data much easier:

{
  "type": "Invoice",
  "total": 22.5,
  "date": "2017-01-10T00:00:00.000Z",
  "customerName": "me",
  "original": {
    "supplier": "supplier B",
    "total": 22.5,
    "date": "10 Jan 2017",
    "customerName": "me"
  }
},

{
  "type": "Invoice",
  "total": 10.2,
  "date": "2017-01-12T00:00:00:00.000Z,
  "customerName": "me",
  "original": {
    "subType": "supplier A",
    "InvoiceTotal": 10.2,
    "OrderDate": <some other date format>,
    "customerName": "me"
  }
}

I d' also convert the date to ISO format because it parses well with new Date(), sorts correctly and is human-readable. You can easily emit invoices grouped by year, month, day and whatever with that.

Use reduce preferably only with built-in functions, because reduces have to be re-executed on queries, and executing JavaScript on many documents is a complex and time-intensive operation, even if the database has not changed at all. You find more information about the reduce process in the CouchDB process. It makes more sense to preprocess the documents as much as you can before storing them in CouchDB.

来源：https://stackoverflow.com/questions/43597244/is-it-advisable-to-use-mapreduce-to-flatten-irregular-entities-in-couchdb

标签

join

MapReduce

couchdb