MapReduce | 易学教程

MapReduce框架原理之（一）InputFormat数据输入

阅读更多关于 MapReduce框架原理之（一）InputFormat数据输入

MapReduce框架原理之InputFormat数据输入 1 切片与MapTask并行度决定机制 2 FileInputFormat切片源码解析 3 CombineTextInputFormat切片机制 3.1 TextInputFormat 3.2 KeyValueTextInputFoamte 3.3 NLineInputFormat 3.4 KeyValueTextInputFormat 4 自定义InputFormat 4.1 自定义InputFormat演示 4.2 代码实现 1 切片与MapTask并行度决定机制 MapTask并行度决定机制（1）数据块：Block，在HDFS中，将数据物理上分成一块一块. （2）数据切片：数据切片只是在逻辑上对输入的数据进行分片，并不会在磁盘上真的将数据切成多片存储. 为什么要按默认块大小来切片：假设，我们现在有一个大小为300M的xx.iso文件（1）如果按照默认的Block大小（128M）来切分，可以分为3块：这三块分别被存放在3个节点上，如下所示：这时前两个块都是满的（2）如果按照均等分，即100M/块来切分，也可以分成 a ， b ， c 三块：但是切分之后，会出现如下状态：由于HDFS的block默认是128M的大小，DataNode1中，的某个block中， 0~100M 的区间存储了切分出来的块 a

MapReduce框架原理之（二）MapReduce工作流程

阅读更多关于 MapReduce框架原理之（二）MapReduce工作流程

MapReduce框架原理之MapReduce工作流程 MapReduce工作流程 1. 流程图 2. 流程详解 3. shuffle机制 3.1 MapTask中： 3.2 ReduceTask中： 3.3 Partition分区： 3.4 WritableComParable排序 3.5 Combiner合并 3.6 GroupingComparator分组（辅助排序） MapReduce工作流程 1. 流程图 MapReduce流程图（1） MapReduce流程图（2） 2. 流程详解 MapReduce执行机制这里只是指出个人理解的部分，帮助记忆MapReduce的工作流程，实际上细节还有很多，有不妥的地方还请多多指教. 实际上，我们在Driver调用了job.waitForCompletion后，客户端并不是马上将job提交给YARN，在向YARN提交job之前，客户端会先通过反射，获取到job将要使用到的InputFormat，以获得逻辑的切片规则，并将切片规则记录到本地的文件中： windowns端的话在C:\tmp\hadoop-PC_NAME\mapred\staging\PC_NAMEJOBID\.staging\job_localJOBID目录下（执行完毕后会被删除）（注：InputFormat只是进行逻辑切片规则的指定，而不是真正的进行物理切片

Efficient way to delete multiple rows in HBase

阅读更多关于 Efficient way to delete multiple rows in HBase

问题 Is there an efficient way to delete multiple rows in HBase or does my use case smell like not suitable for HBase? There is a table say 'chart', which contains items that are in charts. Row keys are in the following format: chart|date_reversed|ranked_attribute_value_reversed|content_id Sometimes I want to regenerate chart for a given date, so I want to delete all rows starting from 'chart|date_reversed_1' till 'chart|date_reversed_2'. Is there a better way than to issue a Delete for each row

HBase : get(…) vs scan and in-memory table

阅读更多关于 HBase : get(…) vs scan and in-memory table

问题 I'm executing MR over HBase. The business logic in the reducer heavily accesses two tables, say T1(40k rows) and T2(90k rows). Currently, I'm executing the following steps : 1.In the constructor of the reducer class, doing something like this : HBaseCRUD hbaseCRUD = new HBaseCRUD(); HTableInterface t1= hbaseCRUD.getTable("T1", "CF1", null, "C1", "C2"); HTableInterface t2= hbaseCRUD.getTable("T2", "CF1", null, "C1", "C2"); In the reduce(...) String lowercase = ....; /* Start : HBase code */ /*

MapReduce Output ArrayWritable

阅读更多关于 MapReduce Output ArrayWritable

问题 I'm trying to get an output from an ArrayWritable in a simple MapReduce-Task. I found a few questions with a similar problem, but I can't solve the problem in my own code. So I'm looking forward to your help. Thanks :)! Input: Textfile with some sentence. Output should be: <Word, <length, number of same words in Textfile>> Example: Hello 5 2 The output that I get in my Job is: hello WordLength_V01$IntArrayWritable@221cf05 test WordLength_V01$IntArrayWritable@799e525a I think the problem is in

Hive无法执行MapReduce任务

阅读更多关于 Hive无法执行MapReduce任务

检查HDFS和yarn集群否工作正常，基本都是yarn没启动或者服务挂掉了，实在不行重启hadoop集群即可，保证HDFS和yarn集群的工作正常来源： CSDN 作者： MCY-- 链接： https://blog.csdn.net/qq_39211575/article/details/103840400

Python教程之mapreduce

阅读更多关于 Python教程之mapreduce

1.map()函数 map()函数接收两个参数，一个是函数，一个是Iterable，map将传入的函数依次作用到序列的每个元素，并把结果作为新的Iterator返回。 def f(x): return x*x r = map(f, [1,2,3,4,5]) list(r) Out： [1, 4, 9, 16, 25] 2.reduce()函数如果要把序列[1, 2, 3, 4, 5, 6]变换成整数123456，reduce就可以派上用场： from functools import reduce def fn(x, y): return x*10+y reduce(fn, [1,2,3,4,5,6]) Out： 123456 作业1 利用map()函数，把用户输入的不规范的英文名字，变为首字母大写，其他小写的规范名字。输入：[‘adam’, ‘LISA’, ‘barT’]，输出：[‘Adam’, ‘Lisa’, ‘Bart’]： L1 = ['adam', 'LISA', 'barT'] def normalize(name): name = name[0].upper() + name[1:].lower() return name L2 = list(map(normalize , L1)) L2 Out： ['Adam', 'Lisa', 'Bart'] 作业2

What is the most efficient way to do a sorted reduce in PySpark?

阅读更多关于 What is the most efficient way to do a sorted reduce in PySpark?

问题 I am analyzing on-time performance records of US domestic flights from 2015. I need to group by tail number, and store a date sorted list of all the flights for each tail number in a database, to be retrieved by my application. I am not sure which of two options for achieving this is the best one. # Load the parquet file on_time_dataframe = sqlContext.read.parquet('../data/on_time_performance.parquet') # Filter down to the fields we need to identify and link to a flight flights = on_time

了解MapReduce运行机制

阅读更多关于了解MapReduce运行机制

MapReduce初识 Hadoop与MapReduce的关系 Hadoop提供了一个稳定的共享存储和分析系统，存储由HDFS实现，分析由MapReduce实现。对MapReduce的理解 MapReduce的工作过程分为两个阶段：map阶段和reduce阶段，每个阶段都有键值对作为输入和输出。 MapReduce作业是客户端执行的单位，包括输入数据、MapReduce·程序和配置信息，Hadoop通过把作业分为若干小任务来工作，其中包括map任务和reduce任务。有两种类型的节点控制着作业的执行过程：jobtracker和多个taskreacker。jobtracker通过调度任务在tasktracker上运行，来协调所有运行在系统上的作业。TaskTracker运行任务的同时把进度报告传送到jobtracker，jobtracker记录着每项任务整体进展情况。如果一个任务失败，jobtracker可以重新调度任务到另一个tasktracker。Hadoop把输入的数据划分成等长的小数据发送到MapReduce，称为输入分片，Hadoop为每个分片创建一个map任务，由他来运行用自定义的map函数来分析每个分片中的记录。最佳分片的大小与HDFS块大小相同它是最大的可保证存储在单个节点上的数据量，如果分片跨块，就会存在网络传输，与使用本地数据运行map任务相比

Pass a Delete or a Put error in hbase mapreduce

阅读更多关于 Pass a Delete or a Put error in hbase mapreduce

问题 I am getting below Error while running mapreduce on hbase: java.io.IOException: Pass a Delete or a Put at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:125) at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:84) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at

订阅 MapReduce