MapReduce

How to fix “Task attempt_201104251139_0295_r_000006_0 failed to report status for 600 seconds.”

久未见 提交于 2019-12-20 08:40:56
问题 I wrote a mapreduce job to extract some info from a dataset. The dataset is users' rating about movies. The number of users is about 250K and the number of movies is about 300k. The output of map is <user, <movie, rating>*> and <movie,<user,rating>*> . In the reducer, I will process these pairs. But when I run the job, the mapper completes as expected, but reducer always complain that Task attempt_* failed to report status for 600 seconds. I know this is due to failed to update status, so I

Can OLAP be done in BigTable?

喜欢而已 提交于 2019-12-20 08:24:08
问题 In the past I used to build WebAnalytics using OLAP cubes running on MySQL. Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.). The queries that you run on a table like this are usually of the form (meta

Parsing of Stackoverflow`s posts.xml on hadoop

我的未来我决定 提交于 2019-12-20 07:51:17
问题 I am following this article by Anoop Madhusudanan on codeproject to build a recommendation engine not on cluster but on my system. Problem is when i try to parse posts.xml whose structure is as follows: <row Id="99" PostTypeId="2" ParentId="88" CreationDate="2008-08-01T14:55:08.477" Score="2" Body="<blockquote> <p>The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds.

JOIN in Hive triggers which type of JOIN in MapReduce?

筅森魡賤 提交于 2019-12-20 05:53:07
问题 If I have a query in hive which employs JOIN, lets say a LEFT OUTER JOIN or an INNER JOIN on two tables ON any column, then how do I know which type of JOIN is it getting converted into in the back-end MapReduce (i.e. Map-side JOIN or Reduce-side JOIN) ? Thanks. 回答1: Use explain select ... and check the plan. It explains what exactly map and reduce will do. Also during execution you can check logs on job tracker and see what mapper or reducer processes are doing. For example the following

Problems with starting Oozie workflow

我是研究僧i 提交于 2019-12-20 05:41:49
问题 I have a problem starting a Oozie workflow: Config: <workflow-app name="Hive" xmlns="uri:oozie:workflow:0.4"> <start to="Hive"/> <action name="Hive"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>oozie.hive.defaults</name> <value>hive-default.xml</value> </property> </configuration> <script>/user/hue/oozie/workspaces/hive/hive.sql</script> <param>INPUT_TABLE=movieapp_log_json</param> <param

How to run external program within mapper or reducer giving HDFS files as input and storing output files in HDFS?

杀马特。学长 韩版系。学妹 提交于 2019-12-20 03:52:35
问题 I have a external program which take file as a input and give output file //for example input file: IN_FILE output file: OUT_FILE //Run External program ./vx < ${IN_FILE} > ${OUT_FILE} I want both input and output files in HDFS I have cluster with 8 nodes.And I have 8 input files each have 1 line //1 input file : 1.txt 1:0,0,0 //2 input file : 2.txt 2:0,0,128 //3 input file : 3.txt 3:0,128,0 //5 input file : 4.txt 4:0,128,128 //5 input file : 5.txt 5:128,0,0 //6 input file : 6.txt 6:128,0,128

Hadoop and Python: Disable Sorting

走远了吗. 提交于 2019-12-20 03:41:19
问题 I've realized that when running Hadoop with Python code, either the mapper or reducer (not sure which) is sorting my output before it's printed out by reducer.py . Currently it seems to be sorted alphanumerically. I am wondering if there is a way to completely disable this. I would like the output of the program based off of the order in which it's printed from mapper.py . I've found answers in Java but none for Python. Would I need to modify mapper.py or perhaps the command line arguments?

大数据面试题

房东的猫 提交于 2019-12-20 03:38:01
第一部分选择题 1. 下面哪个程序负责 HDFS 数据存储。 答案C DataNode a)NameNode b)Jobtracker c)DataNode d)secondaryNameNode e)tasktracker NameNode:负责调度,比如你需要存一个640m的文件 如果按照64m分块 那么namenode就会把这10个块(这里不考虑副本)分配到集群中的datanode上 并记录对于关系 。当你要下载这个文件的时候namenode就知道在哪些节点上给你取这些数据了。。。它主要维护两个map 一个是文件到块的对应关系 一个是块到节点的对应关系。(文件分成哪些块,这些块分别在哪些节点) 2. HDfS 中的 block 默认保存几份? 答案A默认3分 a)3 份 b)2 份 c)1 份 d)不确定 3. 下列哪个程序通常与 NameNode 在一个节点启动? 答案D a)SecondaryNameNode b)DataNode c)TaskTracker d)Jobtracker 此题分析: hadoop的集群是基于master/slave模式,namenode和jobtracker属于master,datanode和tasktracker属于slave,master只有一个,而slave有多个SecondaryNameNode内存需求和NameNode在一个数量级上

MongoDB MapReduce: Not working as expected for more than 1000 records

让人想犯罪 __ 提交于 2019-12-20 02:37:44
问题 I wrote a mapreduce function where the records are emitted in the following format {userid:<xyz>, {event:adduser, count:1}} {userid:<xyz>, {event:login, count:1}} {userid:<xyz>, {event:login, count:1}} {userid:<abc>, {event:adduser, count:1}} where userid is the key and the remaining are the value for that key. After the MapReduce function, I want to get the result in following format {userid:<xyz>,{events: [{adduser:1},{login:2}], allEventCount:3}} To acheive this I wrote the following

Jobtracker API error - Call to localhost/127.0.0.1:50030 failed on local exception: java.io.EOFException

人盡茶涼 提交于 2019-12-20 01:34:18
问题 I m trying to connect my jobtracker using Java. The below shown is the program I am trying to execute public static void main(String args[]) throws IOException { Configuration conf = new Configuration(); conf.addResource(new Path( "/home/user/hadoop-1.0.3/conf/core-site.xml")); conf.addResource(new Path( "/home/user/hadoop-1.0.3/conf/hdfs-site.xml")); conf.addResource(new Path( "/home/user/hadoop-1.0.3/conf/mapred-site.xml")); InetSocketAddress jobtracker = new InetSocketAddress("localhost",