MapReduce | 易学教程

Creating more like this in RavenDB

阅读更多关于 Creating more like this in RavenDB

问题 I have these documents in my domain: public class Article { public string Id { get; set; } // some other properties public IList<string> KeywordIds { get; set; } } public class Keyword { public string Id { get; set; } public string UrlName { get; set; } public string Title { get; set; } public string Tooltip { get; set; } public string Description { get; set; } } I have this scenario: Article A1 has keyword K1 Article A2 has keyword K1 One user reads article A1 I want to suggest user to read

Spark RDD find by key

阅读更多关于 Spark RDD find by key

问题 I have an RDD transformed from HBase: val hbaseRDD: RDD[(String, Array[String])] where the tuple._1 is the rowkey. and the array are the values in HBase. 4929101-ACTIVE, ["4929101","2015-05-20 10:02:44","dummy1","dummy2"] 4929102-ACTIVE, ["4929102","2015-05-20 10:02:44","dummy1","dummy2"] 4929103-ACTIVE, ["4929103","2015-05-20 10:02:44","dummy1","dummy2"] I also have a SchemaRDD (id,date1,col1,col2,col3) transformed to val refDataRDD: RDD[(String, Array[String])] for which I will iterate over

Need a CouchDB trick to sort by date and filter by group

阅读更多关于 Need a CouchDB trick to sort by date and filter by group

问题 I have documents with fields 'date' and 'group'. And this is my view: byDateGroup: { map: function(doc) { if (doc.date && doc.group) { emit([doc.date, doc.group], null); } } } What would be the equivalent query of this: select * from docs where group in ("group1", "group2") order by date desc This simple solution is not coming into my head. :( 回答1: Pankaj, switch the order of the key you're emitting to this: emit([doc.group, doc.date], doc); Then you can pass in a start key and an end key

Raven DB: How to create “UniqueVisitorCount by date” index

阅读更多关于 Raven DB: How to create “UniqueVisitorCount by date” index

问题 I have an application to track the page visits for a website. Here's my model: public class VisitSession { public string SessionId { get; set; } public DateTime StartTime { get; set; } public string UniqueVisitorId { get; set; } public IList<PageVisit> PageVisits { get; set; } } When a visitor go to the website, a visit session starts. One visit session has many page visits. The tracker will write a UniqueVisitorId (GUID) cookie when the first time a visitor go to the website. So we are able

How do I get the values from the counter after I processed all the records with Google AppEngine MapReduce?

阅读更多关于 How do I get the values from the counter after I processed all the records with Google AppEngine MapReduce?

问题 How do I get the values from the counter after I processed all the records with Google AppEngine MapReduce? Or am I missing the use case for counters here? Sample Code from http://code.google.com/p/appengine-mapreduce/wiki/UserGuidePython How would I retrieve the value of counter counter1 when the mapreduce is done? app.yaml handlers: - url: /mapreduce(/.*)? script: mapreduce/main.py login: admin mapreduce/main.py from mapreduce import operation as op def process(entity): yield op.counters

Writing a simple group by with map-reduce (Couchbase)

阅读更多关于 Writing a simple group by with map-reduce (Couchbase)

问题 I'm new to the whole map-reduce concept, and i'm trying to perform a simple map-reduce function. I'm currently working with Couchbase server as my NoSQL db. I want to get a list of all my types: key: 1, value: null key: 2, value: null key: 3, value: null Here are my documents: { "type": "1", "value": "1" } { "type": "2", "value": "2" } { "type": "3", "value": "3" } { "type": "1", "value": "4" } What I've been trying to do is: Write a map function: function (doc, meta) { emit(doc.type, 0); }

Mongo- Map reduce query working fine for single document, but not for all records

阅读更多关于 Mongo- Map reduce query working fine for single document, but not for all records

问题 i am using the map reduce framework to determine the stats of user like this. trying to find out lastOrderDate and totalOrders for each user. db.order.mapReduce(function() { emit (this.customer,{count:1,orderDate:this.orderDate.interval_start}) }, function(key,values){ var sum =0 ; var lastOrderDate; values.forEach(function(value) { lastOrderDate=value['orderDate']; sum+=value['count']; }); return {totalOrder:sum,lastOrderDate:lastOrderDate}; }, { query:{ status:"DELIVERED","customer"

Apache Pig: unable to run my own pig.jar and pig-withouthadoop.jar

阅读更多关于 Apache Pig: unable to run my own pig.jar and pig-withouthadoop.jar

问题 I have a cluster running Hadoop 0.20.2 and Pig 0.10. I'm interested to add some logs to Pig's source code and to run my own Pig version on the cluster. What I did: built the project with 'ant' command got pig.jar and pig-withouthadoop.jar copied the jars to Pig home directory on the cluster's namenode run a job Then I've got following std output: 2013-03-25 06:35:05,226 [main] WARN org.apache.pig.backend.hadoop20.PigJobControl - falling back to default JobControl (not using hadoop 0.20 ?)

hadoop, python, subprocess failed with code 127

阅读更多关于 hadoop, python, subprocess failed with code 127

问题 I'm trying to run very simple task with mapreduce. mapper.py: #!/usr/bin/env python import sys for line in sys.stdin: print line my txt file: qwerty asdfgh zxc Command line to run the job: hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.8.0.jar \ -input /user/cloudera/In/test.txt \ -output /user/cloudera/test \ -mapper /home/cloudera/Documents/map.py \ -file /home/cloudera/Documents/map.py Error: INFO mapreduce.Job: Task Id : attempt_1490617885665

Job and Task Scheduling In Hadoop

阅读更多关于 Job and Task Scheduling In Hadoop

问题 I am little confused about the terms "Job scheduling" and "Task scheduling" in Hadoop when I was reading about delayed fair scheduling in this slide. Please correct me if I am wrong in my following assumptions: Default scheduler, Capacity scheduler and Fair schedulers are only valid at job level when multiple jobs are scheduled by the user. They don't play any role if there is only single job in the system. These scheduling algorithms form basis for "job scheduling" Each job can have multiple