MapReduce

How to implement LSH by MapReduce?

此生再无相见时 提交于 2020-01-03 03:31:09
问题 Suppose we wish to implement Local Sensitive Hashing(LSH) by MapReduce. Specifically, assume chunks of the signature matrix consist of columns, and elements are key-value pairs where the key is the column number and the value is the signature itself (i.e., a vector of values). (a) Show how to produce the buckets for all the bands as output of a single MapReduce process. Hint: Remember that a Map function can produce several key-value pairs from a single element. (b) Show how another MapReduce

Date Difference less than 15 minutes in Hive

左心房为你撑大大i 提交于 2020-01-03 03:10:13
问题 Below is my query, in which in the last line I am trying to see if the difference between the dates is within 15 minutes. But whenever I run the below query. SELECT TT.BUYER_ID , COUNT(*) FROM (SELECT testingtable1.buyer_id, testingtable1.item_id, testingtable1.created_time from (select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps from testingtable2 LATERAL VIEW explode(purchased_item) exploded_table as prod_and_ts where to_date(from_unixtime(cast(prod

Connection refused to quickstart.cloudera:8020

旧城冷巷雨未停 提交于 2020-01-03 02:41:08
问题 I'm using Cloudera-quickstart 5.5.0 virtualbox Trying to run this on terminal. As you can below, there is an exception. I've searched for solution to solve this on internet and found something. 1-) configuring core-site.xml file. https://datashine.wordpress.com/2014/09/06/java-net-connectexception-connection-refused-for-more-details-see-httpwiki-apache-orghadoopconnectionrefused/ But I can only open this file readable and haven't been able to change it. It seems I need to be root or hdfs user

MapReduce中的Join

拜拜、爱过 提交于 2020-01-02 19:19:10
一. MR中的join的两种方式: 1.reduce side join(面试题) reduce side join是一种最简单的join方式,其主要思想如下: 在map阶段,map函数同时读取两个文件File1和File2,为了区分两种来源的key/value对,对每条数据打一个标签(tag),比如:tag=1表示来自文件File1,tag=2表示来自文件File2。即:map阶段的主要任务是对不同文件中的数据打标签,在shuffle阶段已经自然按key分组. 在reduce阶段,reduce函数获取相同k2的v2 list(v2来自File1和File2 ), 然后对于同一个 key,对File1和File2中的数据进行join(笛卡尔乘积)。即:reduce阶段进行实际的连接操作。 这种方法有2个问题: 1, map阶段没有对数据瘦身,shuffle的网络传输和排序性能很低。 2, reduce端对2个集合做乘积计算,很耗内存,容易导致OOM。 我关于reduce side join的博文总结地址: http://www.cnblogs.com/DreamDrive/p/7692042.html 2.map side join(面试题) 之所以存在reduce side join,是因为在map阶段不能获取所有需要的join字段,即:同一个key对应的字段可能位于不同map中

Mapreduce实例——Map端join

江枫思渺然 提交于 2020-01-02 19:18:19
原理 MapReduce 提供了表连接操作其中包括 Map 端 join 、 Reduce 端 join 还有单表连接,现在我们要讨论的是 Map 端 join , Map 端 join 是指数据到达 map 处理函数之前进行合并的,效率要远远高于 Reduce 端 join ,因为 Reduce 端 join 是把所有的数据都经过 Shuffle ,非常消耗资源。 1.Map 端 join 的使用场景:一张表数据十分小、一张表数据很大。 Map 端 join 是针对以上场景进行的优化:将小表中的数据全部加载到内存,按关键字建立索引。大表中的数据作为 map 的输入,对 map() 函数每一对 <key,value> 输入,都能够方便地和已加载到内存的小数据进行连接。把连接结果按 key 输出,经过 shuffle 阶段, reduce 端得到的就是已经按 key 分组并且连接好了的数据。 为了支持文件的复制, Hadoop 提供了一个类 DistributedCache ,使用该类的方法如下: ( 1 )用户使用静态方法 DistributedCache.addCacheFile() 指定要复制的文件,它的参数是文件的 URI (如果是 HDFS 上的文件,可以这样: hdfs://namenode:9000/home/XXX/file ,其中 9000 是自己配置的

使用MapReduce实现两个文件的Join操作

允我心安 提交于 2020-01-02 19:17:48
数据结构 customer表 1 hanmeimei ShangHai 110 2 leilei BeiJing 112 3 lucy GuangZhou 119 oder表 1 1 50 2 1 200 3 3 15 4 3 350 5 3 58 6 1 42 7 1 352 8 2 1135 9 2 400 10 2 2000 11 2 300 MAPJOIN 场景:我们模拟一个有一份小表一个大表的场景,customer是那份小表,order是那份大表 做法:直接将较小的数据加载到内存中,按照连接的关键字建立索引, 大份数据作为MapTask的输入键值对 map()方法的每次输入都去内存当中直接去匹配连接。 然后把连接结果按 key 输出,这种方法要使用 hadoop中的 DistributedCache 把小份数据分布到各个计算节点, 每个 maptask 执行任务的节点都需要加载该数据到内存,并且按连接关键字建立索引。 环境配置:因为我们是在本地操作的,所以需要配置本地的hadoop 1。下载hadoop 2.解压到一个目录,记住,一会要用 配置电脑环境变量 如果你是一个初学者那么你就创建一个Java工程,步骤自己搜吧,网上一大堆然后创建一个mapjoin的包,在包里创建一个 JoinDemo的类然后如下第24行代码

3.Hadoop测试Yarn和MapReduce

丶灬走出姿态 提交于 2020-01-02 19:03:32
Hadoop测试Yarn和MapReduce 1.配置Yarn (1)配置ResourceManager 生产环境中,一般是重开一台机器作为ResourceManager,这里我们以Master机器代替。 修改yarn-site.xml: <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific

Hadoop MapReduce job starts but can not find Map class?

孤人 提交于 2020-01-02 18:56:33
问题 My MapReduce app counts usage of field values in a Hive table. I managed to build and run it from Eclipse after including all jars from /usr/lib/hadood , /usr/lib/hive and /usr/lib/hcatalog directories. It works. After many frustrations I have also managed to compile and run it as Maven project: <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0

Hadoop MapReduce job starts but can not find Map class?

放肆的年华 提交于 2020-01-02 18:55:29
问题 My MapReduce app counts usage of field values in a Hive table. I managed to build and run it from Eclipse after including all jars from /usr/lib/hadood , /usr/lib/hive and /usr/lib/hcatalog directories. It works. After many frustrations I have also managed to compile and run it as Maven project: <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0

How to update task tracker that my mapper is still running fine as opposed to generating timeout?

半世苍凉 提交于 2020-01-02 17:25:32
问题 I forgot what API/method to call, but my problem is that : My mapper will run more than 10 minutes - and I don't want to increase default timeout. Rather I want to have my mapper send out update ping to task tracker, when it is in the particular code path that consumes time > 10 mins. Please let me know what API/method to call. 回答1: You can simply increase a counter and call progress . This will ensure that the task sends a heartbeat back to the tasktracker to know if its alive. In the new