MapReduce

从hadoop框架与MapReduce模式中谈海量数据处理

邮差的信 提交于 2019-12-08 18:20:07
废话不说直接来一张图如下: 从JVM的角度看Map和Reduce Map阶段包括: 第一读数据:从HDFS读取数据 1、问题:读取数据产生多少个Mapper?? Mapper数据过大的话,会产生大量的小文件,由于Mapper是基于虚拟机的,过多的Mapper创建和初始化及关闭虚拟机都会消耗大量的硬件资源; Mapper数太小,并发度过小,Job执行时间过长,无法充分利用分布式硬件资源; 2 、 Mapper 数量由什么决定?? ( 1 ) 输入文件数目 ( 2 ) 输入文件的大小 ( 3 ) 配置参数 这三个因素决定的。 涉及参数: mapreduce.input.fileinputformat.split.minsize //启动map最小的split size大小,默认0 mapreduce.input.fileinputformat.split.maxsize //启动map最大的split size大小,默认256M dfs.block.size//block块大小,默认64M 计算公式:splitSize = Math.max(minSize, Math.min(maxSize, blockSize)); 例如 默认情况下:例如一个文件800M,Block大小是128M,那么Mapper数目就是7个。6个Mapper处理的数据是128M,1个Mapper处理的数据是32M

数据挖掘、数据分析、海量数据处理的面试题(总结july的博客)

家住魔仙堡 提交于 2019-12-08 18:04:35
缘由 由于有面试通知,现在复习一下十道和海量数据处理相关的题。两篇博客已经讲的非常完备了,但是我怕读懂了并非真的懂,所以必须自己复述一遍。 教你如何迅速秒杀掉:99%的海量数据处理面试题 海量数据处理:十道面试题与十个海量数据处理方法总结 MapReduce技术的初步了解与学习 面试归类 下面6个方面覆盖了大多数关于海量数据处理的面试题: 分而治之/hash映射 + hash统计 + 堆/快速/归并排序; 双层桶划分 Bloom filter/Bitmap; Trie树/数据库/倒排索引; 外排序; 分布式处理之Hadoop/Mapreduce。 下面我讲针对上两篇博客里的海量数据处理的题的解法再复述一遍。 第一类:分治后hash统计再排序 第一题:海量日志数据,提取出某日访问百度次数最多的那个IP 解答: 该题解题思路总共分为三步 分而治之/hash映射:如果该文件过大,不能全部读入内存。我们就必须先利用hash函数将其分割成若干小文件。再分别对各个小文件进行处理。注意这一步我们肯定会将相同的ip放在同一个文件。由于这个题干给的比较少。我只有自己脑补一下,大概给我们的日志中就是百度web服务器自己记录的来自不同的ip的访问。这个日志是按天生成的,所以现在我们要统计一天内,哪一个ip访问本网站百度的次数最多。那么日志中肯定就是记录是访问ip和时间,但是相同的ip肯定是单独的一条

How filter Scan of HBase by part of row key?

孤者浪人 提交于 2019-12-08 17:30:35
问题 I have HBase table with row keys, which consist of text ID and timestamp, like next: ... string_id1.1470913344067 string_id1.1470913345067 string_id2.1470913344067 string_id2.1470913345067 ... How can I filter Scan of HBase (in Scala or Java) to get results with some string ID and timestamp more than some value? Thanks 回答1: Fuzzy row approach is efficient for this kind of requirement and when data is is huge : As explained by this article FuzzyRowFilter takes as parameters row key and a mask

Spark can no longer execute jobs. Executors fail to create directory

こ雲淡風輕ζ 提交于 2019-12-08 16:49:38
问题 We've had a small spark cluster running for a month now that's been successfully executing jobs or let me start up a spark-shell to the cluster. It doesn't matter if I submit a job to the cluster or connect to it using the shell, the error is always the same. root@~]$ $SPARK_HOME/bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/11/10 20:43:01 INFO spark.SecurityManager: Changing view acls to: root, 14/11/10 20:43:01 INFO spark.SecurityManager

大数据(bigdata)练习题

試著忘記壹切 提交于 2019-12-08 16:39:46
1. 在 HDFS 文件系统的根目录下创建递归目录“ 1daoyun/file ”,将附件中的 BigDataSkills.txt 文件,上传到 1daoyun/file 目录中,使用相关命令查看文件系统中 1daoyun/file 目录的文件列表信息。 答: [root@master MapReduce]# hadoop fs -mkdir -p /1daoyun/file [root@master MapReduce]# hadoop fs -put BigDataSkills.txt /1daoyun/file [root@master MapReduce]# hadoop fs -ls /1daoyun/file Found 1 items -rw-r--r-- 3 root hdfs 1175 2018-02-12 08:01 /1daoyun/file/BigDataSkills.txt 2. 在 HDFS 文件系统的根目录下创建递归目录“ 1daoyun/file ”,将附件中的 BigDataSkills.txt 文件,上传到 1daoyun/file 目录中,上传过程指定 BigDataSkills.txt 文件在 HDFS 文件系统中的复制因子为 2 ,并使用 fsck 工具检查存储块的副本数。 答: [root@master MapReduce]#

Why LongWritable (key) has not been used in Mapper class?

隐身守侯 提交于 2019-12-08 16:03:36
Mapper: The Mapper class is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private static final int MISSING = 9999; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs airTemperature =

Pipeling hadoop map reduce jobs

大憨熊 提交于 2019-12-08 14:44:44
I have five map reduce that I am running each separately. I want to pipeline them all together. So, output of one job goes to next job. Currently, I wrote shell script to execute them all. Is there a way to write this in java? Please provide an example. Thanks Jeff Hammerbacher You may find JobControl to be the simplest method for chaining these jobs together. For more complex workflows, I'd recommend checking out Oozie . Hi I had similar requirement One way to do this is after submitting first job execute following Job job1 = new Job( getConf() ); job.waitForCompletion( true ); and then check

Custom Partitioner : N number of keys to N different files

血红的双手。 提交于 2019-12-08 13:42:51
问题 My requirement is to write a custom partitioner. I have these N number of keys coming from mapper for example('jsa','msa','jbac'). Length is not fixed. It can be anyword infact. My requirement is to write a custom partitioner in such a way that It will collect all the same key data in to same file. Number of keys is not fixed. Thank you in Advance. Thanks, Sathish. 回答1: So you have multiple keys which mapper is outputting and you want different reducers for each key and have a separate file

ClassNotFoundException while executing mapreduce program

这一生的挚爱 提交于 2019-12-08 13:36:08
问题 I was trying to execute the word count program in eclipse. But while executing the program I am getting the following error log4j:ERROR Could not instantiate class [org.apache.hadoop.log.metrics.EventCounter]. java.lang.ClassNotFoundException: org.apache.hadoop.log.metrics.EventCounter at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader

How to Map() a function recursively (through nested lists) in R?

穿精又带淫゛_ 提交于 2019-12-08 13:21:01
问题 Disclaimer: this is not a duplicate of this question: How to combine rapply() and mapply(), or how to use mapply/Map recursively? On top of that question, I am asking how to incorporate extra arguments of functions into the recursion. So I have lists: A = list(list(list(c(1,2,3), c(2,3,4)),list(c(1,2,3),c(2,3,4))), list(c(4,3,2), c(3,2,1))) B = list(list(list(c(1,2,3), c(2,3,4)),list(c(1,2,3),c(2,3,4))), list(c(4,3,2), c(3,2,1))) And I need to apply different functions to it recursively that