How long should this Riak Map Reduce query take?

妖精的绣舞 提交于 2019-12-24 15:13:11

问题


I have a bucket with approximately 900,000 records. The majority of those records have a status of PERSISTED in a secondary index. I want to retrieve all base_urls and a count of how many documents belong to each base_url for all docs that are marked PERSISTED.

Here is the query:

curl -X POST -H "content-type: application/json" \
    http://localhost:8098/mapred?chunked=true --data @-<<\EOF
{
    "timeout":600000,
    "inputs":{
       "bucket":"test-bucket",
       "index":"status_bin",
       "key":"PERSISTED"
    },
    "query":[{
        "map":{
            "language":"javascript",
            "source":"
                function(value, keyData, arg) {
                    var data = Riak.mapValuesJson(value)[0];
                    var obj = {};
                    obj[data.base_url] = 1;
                    return [obj];
                }
            "
        }
    },
    {
        "reduce":{
            "language":"javascript",
            "source":"
                function(values, arg){ 
                    return [values.reduce(
                        function(acc, item){ 
                            for(var base_url in item){
                                if(acc[base_url]) {
                                    acc[base_url] = acc[base_url] + 1
                                } else {
                                    acc[base_url] = item[base_url];
                                }
                            }
                            return acc;
                        })
                    ];
                }
            "
        }
    }]

EOF

This is timing out after 10 minutes.

I am on a 16 core 3Ghz AWS node with 20Gb of memory.

Is there something that I am possibly doing wrong, either with my configuration or with the above query?

Should it possibly take so long?

To give perspective, the equivalent query in MySQL would look something like this

SELECT COUNT(*), catalog FROM urls GROUP BY catalog;

I have not tried it, but I suspect that in MySQL a result set from the above query over 900,000 records would take several seconds. I am not meaning to compare Riak to MySQL since I realize that they are very different, but I am wondering how I can at the very least, execute the above query in under 10 minutes.

Thanks!


回答1:


JavaScript MapReduce jobs in Riak use a pool of SpiderMonkey JavaScript VMs, and it is important to tune the size of this pool depending on your usage pattern in order to avoid, or at least reduce, contention. The size of the pool is specified through the 'map_js_vm_count' and 'reduce_js_vm_count' parameters in the app.config file.

As you are running on a single node and have only a single map phase, I would recommend you setting the 'map_js_vm_count' parameter to the size of your ring, which by default is 64. A more in-depth description can be found here.

While map phase processing scale easily and is done in parallel, a central reduce phase can easily become the bottleneck as this is run recursively on a single node. This can be addresses by passing a parameter to the map phase to enable pre-reduce and increase the reduce phase batch size as described here. Enabling pre-reduce will allow the first iteration of the reduce phase to run in parallel, which most likely will increase the efficiency of your job. You will however need to increase the number of VMs available to reduce phase functions by increasing the 'reduce_js_vm_count' parameter quite a bit.

If running large MapReduce jobs concurrently, the number of JavaScript VMs required to support this can become quite large. Converting map and reduce phase functions into Erlang is generally encouraged as it does eliminate JS VM contention and also performs better due to less VM related overhead. This is always recommended for MapReduce jobs that you intend to run on a regular basis.



来源:https://stackoverflow.com/questions/15938179/how-long-should-this-riak-map-reduce-query-take

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!