MapReduce

Hadoop: strange ClassNotFoundException

南楼画角 提交于 2020-01-07 04:07:29
问题 I am getting a classnotfound exception. The class which is claimed to be not found does not exist, but the class name is set as the path to the list of input files for my map reduce jobs. INFO server Running: /usr/lib/hadoop/bin/hadoop --config /var/run/cloudera-scm-agent/process/155-hue/JOBSUBD/hadoop-conf jar tmp.jar /user/hduser/datasets/ /user/hduser/tmp/job_20/ mongodb://slave15/db_8.job_20 Exception in thread "main" java.lang.ClassNotFoundException: /user/hduser/datasets/ at java.lang

Hadoop: Does using CombineFileInputFormat for small files gives performance improvement?

為{幸葍}努か 提交于 2020-01-07 01:53:09
问题 I am new to hadoop and peforming some tests on local machine. There have been many solutions to deal with many small files . I am using CombinedInputFormat which extends CombineFileInputFormat . I see that number of mapper have changed from 100 to 25 with CombinedInputFormat . Should I also expect any performance gain since number of mappers have reduced? I have performed the map-reduce job on many small files without CombinedInputFormat : 100 mappers took 10 minutes But when the map-reduce

分布式分析引擎——Kylin

时光怂恿深爱的人放手 提交于 2020-01-07 01:15:45
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> Apache Kylin 是一个开源的分布式 分析引擎 ,提供 Hadoop/Spark 之上的 SQL 查询接口及 多维分析 ( OLAP )能力以支持超大规模数据,最初由 eBay 开发并贡献至开源社区。它能在 亚秒 内查询巨大的 Hive 表。 Cube构建流程 创建中间表 将中间表的数据均匀分配到不同的文件(防止数据倾斜) 创建字典表 构建cube 形成HBase的K-V结构 将cube data转化成Hfile格式并导入HBase Cube构建算法 逐层构建算法(layer)(中心开花) 每个层级的计算是基于它上一层级的结果来计算的。 每一轮的计算都是一个 MapReduce 任务,且串行执行;一个 N 维的 Cube ,至少需要 N 次 MapReduce Job 。 优点: 此算法充分利用了 MapReduce 的优点,处理了中间复杂的排序和 shuffle 工作,故 而算法代码清晰简单,易于维护; 受益于 Hadoop 的日趋成熟,此算法非常稳定,即便是集群资源紧张时,也能保证 最终能够完成。 缺点: 当 Cube 有比较多维度的时候,所需要的 MapReduce 任务也相应增加;由于 Hadoop的任务调度需要耗费额外资源,特别是集群较庞大的时候,反复递交任务造成的额外开销会相当可观; 由于

Group by X OR Y in Pig

假如想象 提交于 2020-01-06 23:43:47
问题 I am processing a big amount of data with Pig and I need to group records by one field OR another. Be careful that it is not classic GROUP BY X AND Y , I mean, you have to group two records if they have the same value for the attributes X OR Y. For example, given this dataset: 1, a, 'r1' 2, b, 'r2' 3, c, 'r3' 4, a, 'r4' 3, d, 'r5' 5, c, 'r6' 5, e, 'r7' The result of grouping by first OR second field should be: {(1, a, 'r1'), (4, a, 'r4')} {(2, b, 'r2')} {(3, c, 'r3'), (3, d, 'r5'), (5, c, 'r6

MapReduce示例

左心房为你撑大大i 提交于 2020-01-06 16:58:58
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> WordCount map类 import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; /** * mapper * 负责从文件中一行一行读取单词 并根据给定的分隔符分割单词 * 在对每个单词进行处理 * * * mapper类要求指定的4个泛型 * * 由于java的数据类型在完成对象的序列化与反序列化效率较低, * 所以hadoop单独准备了一套数据类型 * long---LongWritable * string----Text * int---IntWritable * * * KEYIN, 每次要读取文件的一行数据,指的是输入<key,value>中的key的数据类型 * keyin的位置就是数据文件中每行数据的偏移量 * VALUEIN, 就是数据文件中每行数据 * KEYOUT, 输出时k的类型 * VALUEOUT, 输出时v的类型 * */ public class

揭秘“撩”大数据的正确姿势:生动示例解说大数据“三驾马车”

安稳与你 提交于 2020-01-06 15:25:00
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 我是我:“缘起于美丽,相识于邂逅,厮守到白头!” 众听众:“呃,难道今天是要分享如何作诗?!” 我是我:“大家不要误会,今天主要的分享不是如何作诗,而是《揭秘:‘撩’大数据的正确姿势》,下面进入正题。” 话说当下技术圈的朋友,一起聚个会聊个天,如果不会点大数据的知识,感觉都融入不了圈子,为了以后聚会时让你有聊有料,接下来就跟随我的讲述,一起与大数据混个脸熟吧,不过在“撩”大数据之前,还是先揭秘一下研发这些年我们都经历了啥? 缘起:应用系统架构的从 0 到 1 揭秘:研发这些年我们都经历了啥? 大道至简。生活在技术圈里,大家静下来想想,无论一个应用系统多庞大、多复杂,无非也就是由一个漂亮的网站门面 + 一个丑陋的管理模块 + 一个闷头干活的定时任务三大板块组成。 我们负责的应用系统当然也不例外,起初设计的时候三大模块绑在一起(All in one),线上跑一个 Tomcat 轻松就搞定,可谓是像极了一个大泥球。 衍化至繁。由于网站模块、管理平台、定时任务三大模块绑定在一起,开发协作会比较麻烦,时不时会有代码合并冲突出现;线上应用升级时,也会导致其它模块暂时不能使用,例如如果修改了一个定时任务的配置,可能会导致网站、管理平台的服务暂时不能用。面对诸多的不便,就不得不对 All in one 的大泥球系统进行拆解。

Nullpointer exception in HBase MapReduce

送分小仙女□ 提交于 2020-01-06 15:18:10
问题 I have a mapreduce app that takes an HBase source data and mapreduces it to another HBase table, all written in Java.When I run it using hadoop jar myhbase.jar It terminated with a NullpointerException as below: 14/01/31 11:07:02 INFO zookeeper.ClientCnxn: Socket connection established to 127.0.0.1/127.0.0.1:2181, initiating session 14/01/31 11:07:02 INFO zookeeper.ClientCnxn: Session establishment complete on server 127.0.0.1/127.0.0.1:2181, sessionid = 0x143e677d6e30007, negotiated timeout

Call mapper when reducer is done

那年仲夏 提交于 2020-01-06 14:45:33
问题 I am executing the job as: hadoop/bin/./hadoop jar /home/hadoopuser/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -D mapred.reduce.tasks=2 -file kmeans_mapper.py -mapper kmeans_mapper.py -file kmeans_reducer.py \ -reducer kmeans_reducer.py -input gutenberg/small_train.csv -output gutenberg/out When the two reducers are done, I would like to do something with the results, so ideally I would like to call another file (another mapper?) which would receive the output of the reducers as

MapReduce outputs lines from input file besides the expected result

落爺英雄遲暮 提交于 2020-01-06 11:04:11
问题 I managed to implement a Map-Reduce in Java, it works for my case but for some reason the output displays besides the desired one, some data from the input file and I can't figure out why? Here is the class, I left a comment in code to the line which cause me problems. If I delete that line it doesn't work anymore, but with that line written I have that awkward output(containing data for my input + the desired output) The problem is in "reduce" method at the bottom - I left a comment there

Resetting Iterable Object

谁说我不能喝 提交于 2020-01-06 08:16:10
问题 I am writing a MR job where in the reducer side, I need to check the size of Iterable before doing anything. Someone did asked the same question long back(How to find the size of an Iterable Object?) but the solutions given are not working.Since the Iterable doesnt have a size() method, please suggesst me how to do it. I tried the following options. Tried to get a iterator object from iterable, but got following typecast error. Same error with ResettableIterator. java.lang.ClassCastException: