MapReduce

Creating more like this in RavenDB

有些话、适合烂在心里 提交于 2019-12-10 18:58:06
问题 I have these documents in my domain: public class Article { public string Id { get; set; } // some other properties public IList<string> KeywordIds { get; set; } } public class Keyword { public string Id { get; set; } public string UrlName { get; set; } public string Title { get; set; } public string Tooltip { get; set; } public string Description { get; set; } } I have this scenario: Article A1 has keyword K1 Article A2 has keyword K1 One user reads article A1 I want to suggest user to read

Spark RDD find by key

浪子不回头ぞ 提交于 2019-12-10 18:23:36
问题 I have an RDD transformed from HBase: val hbaseRDD: RDD[(String, Array[String])] where the tuple._1 is the rowkey. and the array are the values in HBase. 4929101-ACTIVE, ["4929101","2015-05-20 10:02:44","dummy1","dummy2"] 4929102-ACTIVE, ["4929102","2015-05-20 10:02:44","dummy1","dummy2"] 4929103-ACTIVE, ["4929103","2015-05-20 10:02:44","dummy1","dummy2"] I also have a SchemaRDD (id,date1,col1,col2,col3) transformed to val refDataRDD: RDD[(String, Array[String])] for which I will iterate over

Need a CouchDB trick to sort by date and filter by group

老子叫甜甜 提交于 2019-12-10 18:14:03
问题 I have documents with fields 'date' and 'group'. And this is my view: byDateGroup: { map: function(doc) { if (doc.date && doc.group) { emit([doc.date, doc.group], null); } } } What would be the equivalent query of this: select * from docs where group in ("group1", "group2") order by date desc This simple solution is not coming into my head. :( 回答1: Pankaj, switch the order of the key you're emitting to this: emit([doc.group, doc.date], doc); Then you can pass in a start key and an end key

Raven DB: How to create “UniqueVisitorCount by date” index

血红的双手。 提交于 2019-12-10 17:48:49
问题 I have an application to track the page visits for a website. Here's my model: public class VisitSession { public string SessionId { get; set; } public DateTime StartTime { get; set; } public string UniqueVisitorId { get; set; } public IList<PageVisit> PageVisits { get; set; } } When a visitor go to the website, a visit session starts. One visit session has many page visits. The tracker will write a UniqueVisitorId (GUID) cookie when the first time a visitor go to the website. So we are able

How do I get the values from the counter after I processed all the records with Google AppEngine MapReduce?

杀马特。学长 韩版系。学妹 提交于 2019-12-10 17:08:32
问题 How do I get the values from the counter after I processed all the records with Google AppEngine MapReduce? Or am I missing the use case for counters here? Sample Code from http://code.google.com/p/appengine-mapreduce/wiki/UserGuidePython How would I retrieve the value of counter counter1 when the mapreduce is done? app.yaml handlers: - url: /mapreduce(/.*)? script: mapreduce/main.py login: admin mapreduce/main.py from mapreduce import operation as op def process(entity): yield op.counters

Writing a simple group by with map-reduce (Couchbase)

徘徊边缘 提交于 2019-12-10 16:39:51
问题 I'm new to the whole map-reduce concept, and i'm trying to perform a simple map-reduce function. I'm currently working with Couchbase server as my NoSQL db. I want to get a list of all my types: key: 1, value: null key: 2, value: null key: 3, value: null Here are my documents: { "type": "1", "value": "1" } { "type": "2", "value": "2" } { "type": "3", "value": "3" } { "type": "1", "value": "4" } What I've been trying to do is: Write a map function: function (doc, meta) { emit(doc.type, 0); }

Mongo- Map reduce query working fine for single document, but not for all records

自闭症网瘾萝莉.ら 提交于 2019-12-10 15:44:33
问题 i am using the map reduce framework to determine the stats of user like this. trying to find out lastOrderDate and totalOrders for each user. db.order.mapReduce(function() { emit (this.customer,{count:1,orderDate:this.orderDate.interval_start}) }, function(key,values){ var sum =0 ; var lastOrderDate; values.forEach(function(value) { lastOrderDate=value['orderDate']; sum+=value['count']; }); return {totalOrder:sum,lastOrderDate:lastOrderDate}; }, { query:{ status:"DELIVERED","customer"

Apache Pig: unable to run my own pig.jar and pig-withouthadoop.jar

試著忘記壹切 提交于 2019-12-10 15:26:08
问题 I have a cluster running Hadoop 0.20.2 and Pig 0.10. I'm interested to add some logs to Pig's source code and to run my own Pig version on the cluster. What I did: built the project with 'ant' command got pig.jar and pig-withouthadoop.jar copied the jars to Pig home directory on the cluster's namenode run a job Then I've got following std output: 2013-03-25 06:35:05,226 [main] WARN org.apache.pig.backend.hadoop20.PigJobControl - falling back to default JobControl (not using hadoop 0.20 ?)

hadoop, python, subprocess failed with code 127

谁说胖子不能爱 提交于 2019-12-10 15:18:04
问题 I'm trying to run very simple task with mapreduce. mapper.py: #!/usr/bin/env python import sys for line in sys.stdin: print line my txt file: qwerty asdfgh zxc Command line to run the job: hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.8.0.jar \ -input /user/cloudera/In/test.txt \ -output /user/cloudera/test \ -mapper /home/cloudera/Documents/map.py \ -file /home/cloudera/Documents/map.py Error: INFO mapreduce.Job: Task Id : attempt_1490617885665

Job and Task Scheduling In Hadoop

折月煮酒 提交于 2019-12-10 14:47:31
问题 I am little confused about the terms "Job scheduling" and "Task scheduling" in Hadoop when I was reading about delayed fair scheduling in this slide. Please correct me if I am wrong in my following assumptions: Default scheduler, Capacity scheduler and Fair schedulers are only valid at job level when multiple jobs are scheduled by the user. They don't play any role if there is only single job in the system. These scheduling algorithms form basis for "job scheduling" Each job can have multiple