MapReduce

MapReduce框架原理之(一)InputFormat数据输入

可紊 提交于 2020-01-12 20:15:23
MapReduce框架原理之InputFormat数据输入 1 切片与MapTask并行度决定机制 2 FileInputFormat切片源码解析 3 CombineTextInputFormat切片机制 3.1 TextInputFormat 3.2 KeyValueTextInputFoamte 3.3 NLineInputFormat 3.4 KeyValueTextInputFormat 4 自定义InputFormat 4.1 自定义InputFormat演示 4.2 代码实现 1 切片与MapTask并行度决定机制 MapTask并行度决定机制 (1)数据块:Block,在HDFS中,将数据物理上分成一块一块. (2)数据切片:数据切片只是在逻辑上对输入的数据进行分片,并不会在磁盘上真的将数据切成多片存储. 为什么要按默认块大小来切片: 假设,我们现在有一个大小为300M的xx.iso文件 (1)如果按照默认的Block大小(128M)来切分,可以分为3块: 这三块分别被存放在3个节点上,如下所示: 这时前两个块都是满的 (2)如果按照均等分,即100M/块来切分,也可以分成 a , b , c 三块: 但是切分之后,会出现如下状态: 由于HDFS的block默认是128M的大小,DataNode1中,的某个block中, 0~100M 的区间存储了切分出来的块 a

MapReduce框架原理之(二)MapReduce工作流程

老子叫甜甜 提交于 2020-01-12 18:25:30
MapReduce框架原理之MapReduce工作流程 MapReduce工作流程 1. 流程图 2. 流程详解 3. shuffle机制 3.1 MapTask中: 3.2 ReduceTask中: 3.3 Partition分区: 3.4 WritableComParable排序 3.5 Combiner合并 3.6 GroupingComparator分组(辅助排序) MapReduce工作流程 1. 流程图 MapReduce流程图(1) MapReduce流程图(2) 2. 流程详解 MapReduce执行机制 这里只是指出个人理解的部分,帮助记忆MapReduce的工作流程,实际上细节还有很多,有不妥的地方还请多多指教. 实际上,我们在Driver调用了job.waitForCompletion后,客户端并不是马上将job提交给YARN,在向YARN提交job之前,客户端会先通过反射,获取到job将要使用到的InputFormat,以获得逻辑的切片规则,并将切片规则记录到本地的文件中: windowns端的话在C:\tmp\hadoop-PC_NAME\mapred\staging\PC_NAMEJOBID\.staging\job_localJOBID目录下 (执行完毕后会被删除) ( 注:InputFormat只是进行逻辑切片规则的指定,而不是真正的进行物理切片

Efficient way to delete multiple rows in HBase

爷,独闯天下 提交于 2020-01-12 17:23:26
问题 Is there an efficient way to delete multiple rows in HBase or does my use case smell like not suitable for HBase? There is a table say 'chart', which contains items that are in charts. Row keys are in the following format: chart|date_reversed|ranked_attribute_value_reversed|content_id Sometimes I want to regenerate chart for a given date, so I want to delete all rows starting from 'chart|date_reversed_1' till 'chart|date_reversed_2'. Is there a better way than to issue a Delete for each row

HBase : get(…) vs scan and in-memory table

徘徊边缘 提交于 2020-01-12 15:31:26
问题 I'm executing MR over HBase. The business logic in the reducer heavily accesses two tables, say T1(40k rows) and T2(90k rows). Currently, I'm executing the following steps : 1.In the constructor of the reducer class, doing something like this : HBaseCRUD hbaseCRUD = new HBaseCRUD(); HTableInterface t1= hbaseCRUD.getTable("T1", "CF1", null, "C1", "C2"); HTableInterface t2= hbaseCRUD.getTable("T2", "CF1", null, "C1", "C2"); In the reduce(...) String lowercase = ....; /* Start : HBase code */ /*

MapReduce Output ArrayWritable

╄→гoц情女王★ 提交于 2020-01-12 14:29:50
问题 I'm trying to get an output from an ArrayWritable in a simple MapReduce-Task. I found a few questions with a similar problem, but I can't solve the problem in my own code. So I'm looking forward to your help. Thanks :)! Input: Textfile with some sentence. Output should be: <Word, <length, number of same words in Textfile>> Example: Hello 5 2 The output that I get in my Job is: hello WordLength_V01$IntArrayWritable@221cf05 test WordLength_V01$IntArrayWritable@799e525a I think the problem is in

Python教程之mapreduce

随声附和 提交于 2020-01-12 08:18:21
1.map()函数 map()函数接收两个参数,一个是函数,一个是Iterable,map将传入的函数依次作用到序列的每个元素,并把结果作为新的Iterator返回。 def f(x): return x*x r = map(f, [1,2,3,4,5]) list(r) Out: [1, 4, 9, 16, 25] 2.reduce()函数 如果要把序列[1, 2, 3, 4, 5, 6]变换成整数123456,reduce就可以派上用场: from functools import reduce def fn(x, y): return x*10+y reduce(fn, [1,2,3,4,5,6]) Out: 123456 作业1 利用map()函数,把用户输入的不规范的英文名字,变为首字母大写,其他小写的规范名字。输入:[‘adam’, ‘LISA’, ‘barT’],输出:[‘Adam’, ‘Lisa’, ‘Bart’]: L1 = ['adam', 'LISA', 'barT'] def normalize(name): name = name[0].upper() + name[1:].lower() return name L2 = list(map(normalize , L1)) L2 Out: ['Adam', 'Lisa', 'Bart'] 作业2

What is the most efficient way to do a sorted reduce in PySpark?

浪子不回头ぞ 提交于 2020-01-12 07:35:28
问题 I am analyzing on-time performance records of US domestic flights from 2015. I need to group by tail number, and store a date sorted list of all the flights for each tail number in a database, to be retrieved by my application. I am not sure which of two options for achieving this is the best one. # Load the parquet file on_time_dataframe = sqlContext.read.parquet('../data/on_time_performance.parquet') # Filter down to the fields we need to identify and link to a flight flights = on_time

了解MapReduce运行机制

房东的猫 提交于 2020-01-12 06:10:48
MapReduce初识 Hadoop与MapReduce的关系 Hadoop提供了一个稳定的共享存储和分析系统,存储由HDFS实现,分析由MapReduce实现。 对MapReduce的理解 MapReduce的工作过程分为两个阶段:map阶段和reduce阶段,每个阶段都有键值对作为输入和输出。 MapReduce作业是客户端执行的单位,包括输入数据、MapReduce·程序和配置信息,Hadoop通过把作业分为若干小任务来工作,其中包括map任务和reduce任务。 有两种类型的节点控制着作业的执行过程:jobtracker和多个taskreacker。jobtracker通过调度任务在tasktracker上运行,来协调所有运行在系统上的作业。TaskTracker运行任务的同时把进度报告传送到jobtracker,jobtracker记录着每项任务整体进展情况。如果一个任务失败,jobtracker可以重新调度任务到另一个tasktracker。Hadoop把输入的数据划分成等长的小数据发送到MapReduce,称为输入分片,Hadoop为每个分片创建一个map任务,由他来运行用自定义的map函数来分析每个分片中的记录。 最佳分片的大小与HDFS块大小相同 它是最大的可保证存储在单个节点上的数据量,如果分片跨块,就会存在网络传输,与使用本地数据运行map任务相比

Pass a Delete or a Put error in hbase mapreduce

放肆的年华 提交于 2020-01-11 13:34:11
问题 I am getting below Error while running mapreduce on hbase: java.io.IOException: Pass a Delete or a Put at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:125) at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:84) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at