MapReduce

How to specify KeyValueTextInputFormat Separator in Hadoop-.20 api?

ε祈祈猫儿з 提交于 2019-12-28 10:13:48
问题 In new API (apache.hadoop.mapreduce.KeyValueTextInputFormat) , how to specify separator (delimiter) other than tab(which is default) to separate key and Value. Sample Input : one,first line two,second line Ouput Required : Key : one Value : first line Key : two Value : second line I am specifying KeyValueTextInputFormat as : Job job = new Job(conf, "Sample"); job.setInputFormatClass(KeyValueTextInputFormat.class); KeyValueTextInputFormat.addInputPath(job, new Path("/home/input.txt")); This is

How to specify KeyValueTextInputFormat Separator in Hadoop-.20 api?

为君一笑 提交于 2019-12-28 10:11:32
问题 In new API (apache.hadoop.mapreduce.KeyValueTextInputFormat) , how to specify separator (delimiter) other than tab(which is default) to separate key and Value. Sample Input : one,first line two,second line Ouput Required : Key : one Value : first line Key : two Value : second line I am specifying KeyValueTextInputFormat as : Job job = new Job(conf, "Sample"); job.setInputFormatClass(KeyValueTextInputFormat.class); KeyValueTextInputFormat.addInputPath(job, new Path("/home/input.txt")); This is

Large Block Size in HDFS! How is the unused space accounted for?

会有一股神秘感。 提交于 2019-12-28 09:18:29
问题 We all know that the block size in HDFS is pretty large (64M or 128M) as compared to the block size in traditional file systems. This is done in order to reduce the percentage of seek time compared to the transfer time (Improvements in transfer rate have been on a much larger scale than improvements on the disk seek time therefore, the goal while designing a file system is always to reduce the number of seeks in comparison to the amount of data to be transferred). But this comes with an

MongoDB: Terrible MapReduce Performance

那年仲夏 提交于 2019-12-28 07:39:07
问题 I have a long history with relational databases, but I'm new to MongoDB and MapReduce, so I'm almost positive I must be doing something wrong. I'll jump right into the question. Sorry if it's long. I have a database table in MySQL that tracks the number of member profile views for each day. For testing it has 10,000,000 rows. CREATE TABLE `profile_views` ( `id` int(10) unsigned NOT NULL auto_increment, `username` varchar(20) NOT NULL, `day` date NOT NULL, `views` int(10) unsigned default '0',

What is Google's Dremel? How is it different from Mapreduce?

二次信任 提交于 2019-12-28 03:26:06
问题 Google's Dremel is described here. What's the difference between Dremel and Mapreduce? 回答1: Check this article out. Dremel is the what the future of hive should (and will) be. The major issue of MapReduce and solutions on top of it, like Pig, Hive etc, is that they have an inherent latency between running the job and getting the answer. Dremel uses a totally novel approach (came out in 2010 in that paper by google) which... ...uses a novel query execution engine based on aggregator trees... .

Explode the Array of Struct in Hive

白昼怎懂夜的黑 提交于 2019-12-28 03:26:05
问题 This is the below Hive Table CREATE EXTERNAL TABLE IF NOT EXISTS SampleTable ( USER_ID BIGINT, NEW_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>> ) And this is the data in the above table- 1015826235 [{"product_id":220003038067,"timestamps":"1340321132000"},{"product_id":300003861266,"timestamps":"1340271857000"}] Is there any way I can get the below output from the HiveQL after exploding the array? **USER_ID** | **PRODUCT_ID** | **TIMESTAMPS** ------------+------------------+------

分布式计算框架MapReduce

风格不统一 提交于 2019-12-27 18:10:10
MapReduce概述 MapReduce源自Google的MapReduce论文,论文发表于2004年12月。Hadoop MapReduce可以说是Google MapReduce的一个开源实现。MapReduce优点在于可以将海量的数据进行离线处理,并且MapReduce也易于开发,因为MapReduce框架帮我们封装好了分布式计算的开发。而且对硬件设施要求不高,可以运行在廉价的机器上。MapReduce也有缺点,它最主要的缺点就是无法完成实时流式计算,只能离线处理。 MapReduce属于一种编程模型,用于大规模数据集(大于1TB)的并行运算。概念"Map(映射)"和"Reduce(归约)",是它们的主要思想,都是从函数式编程语言里借来的,还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下,将自己的程序运行在分布式系统上。 当前的软件实现是指定一个Map(映射)函数,用来把一组键值对映射成一组新的键值对,指定并发的Reduce(归约)函数,用来保证所有映射的键值对中的每一个共享相同的键组。 MapReduce官方文档地址如下: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial

Web 日志mapreduce 预处理清洗

ぃ、小莉子 提交于 2019-12-27 18:09:48
WEB访问日志 即指用户访问网站时的所有访问、浏览、点击行为数据。比如点击了哪一个链接,在哪个网页停留时间最多,采用了哪个搜索项、总体浏览时间等。而所有这些信息都可被保存在网站日志中。通过分析这些数据,可以获知许多对网站运营至关重要的信息。采集的数据越全面,分析就能越精准。 日志的生成渠道: 1.是网站的web服务器所记录的web访问日志 2.是通过在页面嵌入自定义的js代码来获取用户的所有访问行为(比如鼠标悬停的位置,点击的页面组件等),然后通过ajax请求到后台记录日志;这种方式所能采集的信息最全面; 3.通过在页面上埋点1像素的图片,将相关页面访问信息请求到后台记录日志; 日志数据内容详述: 在实际操作中,有以下几个方面的数据可以被采集: 1.访客的系统属性特征。比如所采用的操作系统、浏览器、域名和访问速度等。 2.访问特征。包括停留时间、点击的URL等。 3.来源特征。包括网络内容信息类型、内容分类和来访URL等。 以网站点击日志为例,其点击日志格式如下: 194.237.142.21 - - [18/Sep/2013:06:49:18 +0000] "GET /wp-content/uploads/2013/07/rstudio-git3.png HTTP/1.1" 304 0 "-" "Mozilla/4.0 (compatible;)" 183.49.46.228 -

Hadoop学习之路(5)Mapreduce程序完成wordcount

不羁的心 提交于 2019-12-27 18:03:34
程序使用的测试文本数据 : Dear River Dear River Bear Spark Car Dear Car Bear Car Dear Car River Car Spark Spark Dear Spark 1编写主要类 (1)Maper类 首先是自定义的Maper类代码 public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //fields:代表着文本一行的的数据: dear bear river String[] words = value.toString().split("\t"); for (String word : words) { // 每个单词出现1次,作为中间结果输出 context.write(new Text(word), new IntWritable(1)); } } }      这个Map类是一个泛型类型,它有四个形参类型,分别指定map()函数的输入键、输入值、输出键和输出值的类型。 LongWritable

hive优化之调整mapreduce数目

被刻印的时光 ゝ 提交于 2019-12-27 00:43:32
一、 调整hive作业中的 map数 1.通常情况下,作业会通过input的目录产生一个或者多个map任务。 主要的决定因素有: input的文件总个数,input的文件大小,集群设置的文件块大小(目前为128M, 可在hive中通过set dfs.block.size;命令查看到,该参数不能自定义修改); 2.举例: a)假设input目录下有1个文件a,大小为780M,那么 hadoop 会将该文件a分隔成7个块(6个128m的块和1个12m的块),从而产生7个map数 b)假设input目录下有3个文件a,b,c,大小分别为10m,20m,130m,那么hadoop会分隔成4个块(10m,20m,128m,2m),从而产生4个map数,即,如果文件大于块大小(128m),那么会拆分,如果小于块大小,则把该文件当成一个块。 3.是不是map数越多越好? 答案是否定的。如果一个任务有很多小文件(远远小于块大小128m),则每个小文件也会被当做一个块,用一个map任务来完成,而一个map任务启动和初始化的时间远远大于逻辑处理的时间,就会造成很大的资源浪费。而且,同时可执行的map数是受限的。 4.是不是保证每个map处理接近128m的文件块,就高枕无忧了? 答案也是不一定。比如有一个127m的文件,正常会用一个map去完成,但这个文件只有一个或者两个小字段,却有几千万的记录