MapReduce

大数据开发必须掌握的五大核心技术

痴心易碎 提交于 2020-01-02 17:07:53
大数据技术的体系庞大且复杂,基础的技术包含数据的采集、数据预处理、分布式存储、NoSQL数据库、数据仓库、机器学习、并行计算、可视化等各种技术范畴和不同的技术层面。首先给出一个通用化的大数据处理框架,主要分为下面几个方面:数据采集与预处理、数据存储、数据清洗、数据查询分析和数据可视化。 一、数据采集与预处理 对于各种来源的数据,包括移动互联网数据、社交网络的数据等,这些结构化和非结构化的海量数据是零散的,也就是所谓的数据孤岛,此时的这些数据并没有什么意义,数据采集就是将这些数据写入数据仓库中,把零散的数据整合在一起,对这些数据综合起来进行分析。数据采集包括文件日志的采集、数据库日志的采集、关系型数据库的接入和应用程序的接入等。在数据量比较小的时候,可以写个定时的脚本将日志写入存储系统,但随着数据量的增长,这些方法无法提供数据安全保障,并且运维困难,需要更强壮的解决方案。 Flume NG作为实时日志收集系统,支持在日志系统中定制各类数据发送方,用于收集数据,同时,对数据进行简单处理,并写到各种数据接收方(比如文本,HDFS,Hbase等)。Flume NG采用的是三层架构:Agent层,Collector层和Store层,每一层均可水平拓展。其中Agent包含Source,Channel和 Sink,source用来消费(收集)数据源到channel组件中

How to count occurence of each value in array?

对着背影说爱祢 提交于 2020-01-02 15:51:36
问题 I have a database of ISSUES in MongoDB, some of the issues have comments, which is an array; each comments has a writer. How can I count the number of comments each writer has written? I've tried db.test.issues.group( { key = "comments.username":true; initial: {sum:0}, reduce: function(doc, prev) {prev.sum +=1}, } ); but no luck :( A Sample: { "_id" : ObjectId("50f48c179b04562c3ce2ce73"), "project" : "Ruby Driver", "key" : "RUBY-505", "title" : "GETMORE is sent to wrong server if an

How to do a cross join / cartesian product in RavenDB?

别来无恙 提交于 2020-01-02 15:03:13
问题 I have a web application that uses RavenDB on the backend and allows the user to keep track of inventory. The three entities in my domain are: public class Location { string Id string Name } public class ItemType { string Id string Name } public class Item { string Id DenormalizedRef<Location> Location DenormalizedRef<ItemType> ItemType } On my web app, there is a page for the user to see a summary breakdown of the inventory they have at the various locations. Specifically, it shows the

Exception in thread “main” java.lang.VerifyError: Bad type on operand stack

瘦欲@ 提交于 2020-01-02 11:59:50
问题 This error has been occured in map-reduce program for find max-temperature in given input.txt file. i write two column which are year and temperature. Exception in thread "main" java.lang.VerifyError: Bad type on operand stack Exception Details: Location: org/apache/hadoop/mapred/JobTrackerInstrumentation.create(Lorg/apache/hadoop/mapred/JobTracker;Lorg/apache/hadoop/mapred/JobConf;)Lorg/apache/hadoop/mapred/JobTrackerInstrumentation; @5: invokestatic Reason: Type 'org/apache/hadoop/metrics2

Get Line number in map method using FileInputFormat

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-02 10:19:37
问题 I was wondering whether it is possible to get the line number in my map method? My input file is just a single column of values like, Apple Orange Banana Is it possible to get key: 1, Value: Apple , Key: 2, Value: Orange ... in my map method? Using CDH3/CDH4. Changing the input data so as to use KeyValueInputFormat is not an option. Thanks ahead. 回答1: The default behaviour of InputFormats such as TextInputFormat is to give the byte offset of the record rather than the actual line number -

How to remove r-00000 extention from reducer output in mapreduce

谁说胖子不能爱 提交于 2020-01-02 10:03:30
问题 I am able to rename my reducer output file correctly but r-00000 is still persisting . I have used MultipleOutputs in my reducer class . Here is details of the that .Not sure what am i missing or what extra i have to do? public class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> { private Logger logger = Logger.getLogger(MyReducer.class); private MultipleOutputs<NullWritable, Text> multipleOutputs; String strName = ""; public void setup(Context context) { logger.info(

Hadoop File Splits : CompositeInputFormat : Inner Join

与世无争的帅哥 提交于 2020-01-02 08:15:45
问题 I am using CompositeInputFormat to provide input to a hadoop job. The number of splits generated is the total number of files given as input to CompositeInputFormat ( for joining ). The job is completely ignoring the block size and max split size ( while taking input from CompositeInputFormat). This is resulting into long running Map Tasks and is making system slow as the input files are larger than the block size. Is anyone aware of any way through which the number of splits can be managed

Hadoop mapper reading from 2 different source input files

一曲冷凌霜 提交于 2020-01-02 07:23:08
问题 I have tool which chains a lot of Mappers & Reducers, and at some point I need merge results from previous map-reduce steps, for example as input I have two files with data: /input/a.txt apple,10 orange,20 */input/b.txt* apple;5 orange;40 result should be c.txt, where c.value = a.value * b.value /output/c.txt apple,50 // 10 * 5 orange,800 // 40 * 20 How it could be done? I've resolved this with introducing simple Key => MyMapWritable (type=1,2, value), and merging (actually, multiplying) data

how to suppress Hadoop logging message on the console

老子叫甜甜 提交于 2020-01-02 06:19:14
问题 These are the Hadoop Logging Message I was trying to surpress 11/10/17 19:42:23 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 11/10/17 19:42:23 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 11/10/17 19:42:23 INFO mapred.MapTask: soft limit at 83886080 11/10/17 19:42:23 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 11/10/17 19:42:23 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 I suppose they are configured by logg 4j.properties under the conf directory

OpenJDK Client VM - Cannot allocate memory

对着背影说爱祢 提交于 2020-01-02 05:28:32
问题 I am running Hadoop map reduce job on a cluster. I am getting this error. OpenJDK Client VM warning: INFO: os::commit_memory(0x79f20000, 104861696, 0) failed; error='Cannot allocate memory' (errno=12) There is insufficient memory for the Java Runtime Environment to continue. Native memory allocation (malloc) failed to allocate 104861696 bytes for committing reserved memory. what to do ? 回答1: make sure you have swap space on your machine ubuntu@VM-ubuntu:~$ free -m total used free shared