MapReduce

Convert Sequence file and get key, value pairs via map and reduce tasks in hadoop

拈花ヽ惹草 提交于 2019-12-12 10:22:10
问题 I want to get all key values pairs from a sequencial files via hadoop map reduce application. I followed following post http://lintool.github.com/Cloud9/docs/content/staging-records.html for reading sequencial file in the main class but that dint work. i want to print all keysvalue pairs to normal text file in hdfs system, how can i achive that ? i wrote my code as bellow. import java.io.File; import java.io.IOException; import java.util.*; import java.util.logging.Level; import java.util

can HBase , MapReduce and HDFS can work on a single machine having Hadoop installed and running on it?

邮差的信 提交于 2019-12-12 10:18:01
问题 I am working on a search engine design, which is to be run on cloud. We have just started, and have not much idea about Hdoop. Can anyone tell if HBase , MapReduce and HDFS can work on a single machine having Hdoop installed and running on it ? 回答1: Yes you can. You can even create a Virtual Machine and run it on there on a single "computer" (which is what I have :) ). The key is to simply install Hadoop in "Pseudo Distributed Mode" which is even described in the Hadoop Quickstart. If you use

Hadoop wordcount unable to run - need help on decoding the hadoop error message

拈花ヽ惹草 提交于 2019-12-12 10:09:28
问题 I need some help on figuring out why my job failed. I built a single node cluster just to try it out. I followed the example here. Everything seems to be working correctly. I formatted the namenode and am able to connect to the jobtracker, datanode, and namenode via the web interface. I am able to start and stop all the hadoop services. However, when I try to run the wordcount example, I get this: Error initializing attempt_201105161023_0002_m_000011_0: java.io.IOException: Exception reading

Mongo MapReduce select latest date

荒凉一梦 提交于 2019-12-12 09:41:24
问题 I can't seem to get my MapReduce reduce function to work properly. Here is my map function: function Map() { day = Date.UTC(this.TimeStamp.getFullYear(), this.TimeStamp.getMonth(),this.TimeStamp.getDate()); emit( { search_dt: new Date(day), user_id: this.UserId }, { timestamp: this.TimeStamp } ); } And here is my reduce function: function Reduce(key, values) { var result = [timestamp:0]; values.forEach(function(value){ if (!value.timestamp) continue; if (result.timestamp < value.timestamp)

mapreduce count example

烂漫一生 提交于 2019-12-12 09:37:54
问题 My question is about mapreduce programming in java . Suppose I have the WordCount.java example, a standard mapreduce program . I want the map function to collect some information, and return to the reduce function maps formed like: <slaveNode_id,some_info_collected> , so that I can know what slave node collected what data .. Any idea how?? public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static

Hadoop performance

狂风中的少年 提交于 2019-12-12 08:10:14
问题 I installed hadoop 1.0.0 and tried out word counting example (single node cluster). It took 2m 48secs to complete. Then I tried standard linux word count program, which run in 10 milliseconds on the same set (180 kB data). Am I doing something wrong, or is Hadoop very very slow? time hadoop jar /usr/share/hadoop/hadoop*examples*.jar wordcount someinput someoutput 12/01/29 23:04:41 INFO input.FileInputFormat: Total input paths to process : 30 12/01/29 23:04:41 INFO mapred.JobClient: Running

Select distinct more than one field using MongoDB's map reduce

爱⌒轻易说出口 提交于 2019-12-12 08:09:47
问题 I want to execute this SQL statement on MongoDB: SELECT DISTINCT book,author from library So far MongoDB's DISTINCT only supports one field at a time. For more than one field, we have to use GROUP command or map-reduce. I have googled a way to use GROUP command: db.library.group({ key: {book:1, author:1}, reduce: function(obj, prev) { if (!obj.hasOwnProperty("key")) { prev.book = obj.book; prev.author = obj.author; }}, initial: { } }); However this approach only supports up to 10,000 keys.

Hadoop Map Reduce For Google web graph

隐身守侯 提交于 2019-12-12 08:06:46
问题 we have been given as an assignment the task of creating map reduce functions that will output for each node n in the google web graph list the nodes that you can go from node n in 3 hops. (The actual data can be found here: http://snap.stanford.edu/data/web-Google.html) Here's an example of how the items in the list will be : 1 2 1 3 2 4 3 4 3 5 4 1 4 5 4 6 5 6 From the above an example graph will be this In the above simplified example the paths for example of node 1 are α [1 -> 2 -> 4 -> 1

Hadoop on windows server

送分小仙女□ 提交于 2019-12-12 07:09:06
问题 I'm thinking about using hadoop to process large text files on my existing windows 2003 servers (about 10 quad core machines with 16gb of RAM) The questions are: Is there any good tutorial on how to configure an hadoop cluster on windows? What are the requirements? java + cygwin + sshd ? Anything else? HDFS, does it play nice on windows? I'd like to use hadoop in streaming mode. Any advice, tool or trick to develop my own mapper / reducers in c#? What do you use for submitting and monitoring

Map-Reduce to combine data (MongoDb)

一笑奈何 提交于 2019-12-12 07:05:48
问题 I have two collections. LogData [{ "SId": 10, "NoOfDaya" : 9, "Status" : 4 } { "SId": 11, "NoOfDaya" : 8, "Status" : 2 }] OptData [ { "SId": 10, "CId": 12, "CreatedDate": ISO(24-10-2014) } { "SId": 10, "CId": 13, "CreatedDate": ISO(24-10-2014) }] Now using mongoDB I need to find the data in form select a.SPID,a.CreatedDate,CID=(MAX(a.CID)) from OptData a Join LogData c on a.SID=c.SID where Status>2 group by a.SPID,a.CreatedDate LogData have 600 records whereas OPTData have 90 millions records