MapReduce

Need help in writing Map/Reduce job to find average

社会主义新天地 提交于 2019-12-12 16:35:33
问题 I'm fairly new to Hadoop Map/Reduce. I'm trying to write a Map/Reduce job to find average time taken by n processes, given an input text file as below: ProcessName Time process1 10 process2 20 processn 30 I went through few tutorials but I'm still not able to get a thorough understanding. What should my mapper and reducer classes do for this problem? Will my output always be a text file or is it possible to directly store the average in some sort of a variable? Thanks. 回答1: Your Mappers read

Hadoop (java) change the type of Mapper output values

空扰寡人 提交于 2019-12-12 16:33:12
问题 I am writing a mapper function that generates the keys as some user_id and the values are also Text type. Here is how I do this public static class UserMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text userid = new Text(); private Text catid = new Text(); /* map method */ public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer

Iterate through ArrayWritable - NoSuchMethodException

霸气de小男生 提交于 2019-12-12 16:21:48
问题 I just started working with MapReduce, and I'm running into a weird bug that I haven't been able to answer through Google. I'm making a basic program using ArrayWritable, but when I run it, I get the following error during Reduce: java.lang.RuntimeException: java.lang.NoSuchMethodException:org.apache.hadoop.io.ArrayWritable.<init>() at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer

How to pass variable between two map reduce jobs

微笑、不失礼 提交于 2019-12-12 16:17:30
问题 I have chained two Map reduce jobs. The Job1 will have only one reducer and I am computing a float value. I want to use this value in my reducer of Job2. This is my main method setup. public static String GlobalVriable; public static void main(String[] args) throws Exception { int runs = 0; for (; runs < 10; runs++) { String inputPath = "part-r-000" + nf.format(runs); String outputPath = "part-r-000" + nf.format(runs + 1); MyProgram.MR1(inputPath); MyProgram.MR2(inputPath, outputPath); } }

What does “RDDs can be stored in memory” mean in Spark?

廉价感情. 提交于 2019-12-12 15:06:16
问题 In the introduction of Spark,it says RDDs can be stored in memory between queries without requiring replication. As I know,you must cache RDD manually by using .cache() or .persist() .If I take neither measure,like below val file = sc.textFile("hdfs://data/kv1.txt") file.flatMap(line => line.split(" ")) file.count() I don't persist the RDD "file" in cache or disk,in this condition, can Spark run faster than MapReduce? 回答1: What will happen is that Spark will compute, partition by partition,

hadoop环境搭建

巧了我就是萌 提交于 2019-12-12 15:05:31
一 hadoop 简介 1 Hadoop 整体框架 Hadoop 由HDFS 、MapReduce、HBASE、hive 和zookeeper 等成员组成,其中最 基础最重要的元素是底层用于存储集群中所有存储节点文件的文件系统HDFS 来 执行MapReduce 程序的MapReduce 引擎 1 pig 是一个基于Hadoop 的大规模数据分析平台,pig 为复杂的海量数据并行计 算提供了一个简单的操作和编程接口 2 hive 是基于Hadoop 的一个工具,提供完整的SQL 查询,可以将sql 语句转换 为MapReduce (映射)任务进行执行 3 zookeeper:高效的,可扩展的协调系统,存储和协调关键共享状态 4 HBASE 是一个开源的,基于列存储模型的分布式数据库 5 hdfs 是一个分布式文件系统,具有高容错的特点,适合于那些超大数据集的应 用程序, 6 MapReduce 是一种编程模式,用于大规模数据集的并行计算 2 hadoop 集群部署结构 3 hadoop 核心设计 1 HDFS 是一个高度容错性的分布式文件系统,可以被广泛的部署于廉价的PC 上,他以流式访问模式访问应用程序的数据,这样可以提高系统的数据吞吐量,因而非常适合用于具有超大数据集的应用程序中 HDFS 架构采用主从架构,一个HDFS 集群应该包含一个namenode

Hadoop map jobs fail with com.datastax.driver.core.exceptions.NoHostAvailableException

↘锁芯ラ 提交于 2019-12-12 13:28:19
问题 I am trying to run analytics using hadoop map-reduce over data stored inside cassandra. For this, I am using the class CqlInputFormat available through the maven dependency cassandra-all . Currently we have been using 2.0.10 version of this dependency in our production environment. Also, we are using caassandra-driver-core having version 2.1.1. Now, when I submit a simple map-reduce job to my jobtracker, all my mapper tasks fail with the below exception. Another important thing to note here

MapReduce入门概述

最后都变了- 提交于 2019-12-12 13:13:51
MapReduce入门 概述 定义:MapReduce是一个分布式运算程序的编程框架,是用户开发“基于hadoop的数据分析应用”的核心框架。 ​ MapReduce的核心功能是将用户编写的业务逻辑代码和自带默认组件整合成一个完整的分布式运算程序,并发运行在一个hadoop集群上。 优缺点: 优点:易于编程,良好的扩展性,高容错性,海量数据的离线处理 缺点:不擅长实时计算、不擅长流式计算、不擅长DAG(有向图)计算 核心思想 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IJ8N0Uvy-1576121351172)(C:\Users\lhx\Desktop\课程资料\Markdow笔记\Hadoop\Maprudce的思想.png)] MapReduce进程:一个完整的MapReduce程序在分布式运行时有三类实例进程 MrAppMaster:负责整个程序的过程调度以及状态调度 MapTask:负责Map阶段的整个数据处理流程 ReduceTask:负责Reduce阶段的整个数据的汇总、处理。 编程规范:用户编写的程序分成三个部分:Mapper、Reducer和Driver。 Mapper阶段 用户自定义的Mapper要继承自己的父类 Mapper的输入数据时KV对的形式 Mapper中的业务逻辑写在map()方法中

RavenDb MapReduce over subset of Data

我的梦境 提交于 2019-12-12 13:05:34
问题 Say I have the given document structure in RavenDb public class Car { public string Manufacturer {get;set;} public int BuildYear {get;set;} public string Colour {get;set;} public string Id {get;set;} } When the user searches for all cars of colour Red and build year 2010, I want to show them a grouping for manufacturer as such: Toyota (12) Mazda (30) Given there are 12 toyotas and 30 mazdas that are red in colour and build year 2010. This is a simplified use case. The user can really specify

RavenDB Map/Reduce over property that is a list

笑着哭i 提交于 2019-12-12 12:43:47
问题 Just learning Map/Reduce and I'm missing a step. I've read this post ( RavenDB Map-Reduce Example using .NET Client ) but can't quite make the jump to what I need. I have an object: public class User : IIdentifiable { public User(string username) { Id = String.Format(@"users/{0}", username); Favorites = new List<string>(); } public IList<string> Favorites { get; protected set; } public string Id { get; set; } } What I want to do is get Map/Reduce the Favorites property across all Users.