Identifying Duplicates in CouchDB

徘徊边缘 提交于 2019-12-07 04:37:31

问题


I'm new to CouchDB and document-oriented databases in general.

I've been playing around with CouchDB, and was able to get familiar with creating documents (with perl) and using the Map/Reduce functions in Futon to query the data and create views.

One of the things I'm still trying to figure out is how to identify duplicate values across documents using Futon's Map/Reduce.

For example, if I have the following documents:

{
  "_id": "123",
  "name": "carl",
  "timestamp": "2012-01-27T17:06:03Z"
}

{
  "_id": "124",
  "name": "carl",
  "timestamp": "2012-01-27T17:07:03Z"
}

And I wanted to get a list of document id's that had duplicate "name" values, is this something I could do with the Futon Map/Reduce?

The result was hoping to achieve is as follows:

{
  "name": "carl",
  "dupes": [ "123", "124" ]
}

..or..

{
  "carl": [ "123", "124" ]
}

.. which would be the value, and associated document ids which contain those duplicate values.

I've tried a few different things with Map/Reduce, but so far as I understand, the Map function works with data on a per-document basis, and the Reduce functions only allow you to work with the keys/values from a given document.

I know i could just pull the data I need with perl, work magic there, and get the result I want, but I'm trying to work only with CouchDB for now in order to better understand it's benefits / limitations.

Another way I'm thinking about doing this is to use a single document like an RDBMS table:

{
  "_id": "names",
  "rec1": {
    "_id": "123",
    "name": "carl",
    "timestamp": "2012-01-27T17:06:03Z"
  },
  "rec2": {
    "_id": "124",
    "name": "carl",
    "timestamp": "2012-01-27T17:07:03Z"
  }
}

.. which should allow me to use the Map/Reduce functions in the way I originally thought. However I'm not sure if this is ideal.

I understand that my mind is still stuck in RDBMS land, so much of what I'm trying to do above may not be necessary. Any insight on this would be much appreciated.

Thanks!

Edit: Fixed JSON syntax in some of the examples.


回答1:


If you merely want a list of unique values, that's pretty easy. If you wish to identify the duplicates, then it gets less easy.

In both cases, a map function like this should suffice:

function (doc) {
   emit(doc.name);
}

For your reduce function, just enter _count.

Your view output will look like: (based on your 2 documents)

{
    "rows": [
        { "key": "carl", "value": 2 }
    ]
}

From there, you will have a list of names as well as their frequency. You can take that list and filter it yourself, or you can take the "all couch" route and use a _list function to perform that final filtering.

function (head, req) {
    var row, duplicates = [];
    while (row = getRow()) {
        if (row.value > 1) {
            duplicates.push(row);
        }
    }
    send(JSON.stringify(duplicates));
}

Read up about _list functions, they're pretty handy and versatile.



来源:https://stackoverflow.com/questions/9038401/identifying-duplicates-in-couchdb

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!