MapReduce

Large data - storage and query [closed]

匆匆过客 提交于 2019-12-10 23:07:55
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 7 months ago . We have a huge data of about 300 million records, which will get updated every 3-6 months.We need to query this data(continously, real time) to get some information.What are the options - a RDBMS(mysql) , or some other option like Hadoop.Which will be better? 回答1: 300M records is

cloudant index: count number of unique users per time period

人走茶凉 提交于 2019-12-10 22:12:13
问题 A very similar post was made about this issue here. In cloudant, I have a document structure storing when users access an application, that looks like the following: {"username":"one","timestamp":"2015-10-07T15:04:46Z"} ---| same day {"username":"one","timestamp":"2015-10-07T19:22:00Z"} ---^ {"username":"one","timestamp":"2015-10-25T04:22:00Z"} {"username":"two","timestamp":"2015-10-07T19:22:00Z"} What I want to know is to count the # of unique users for a given time period. Ex: 2015-10-07 =

Running any Hadoop command fails after enabling security.

末鹿安然 提交于 2019-12-10 21:29:48
问题 I was trying to enable Kerberos for my CDH 4.3 (via Cloudera Manager) test bed. So after changing authentication from Simple to Kerberos in the WebUI, I'm unable to do any hadoop operations as shown below. Is there anyway to specify the keytab explicitly? [root@host-dn15 ~]# su - hdfs -bash-4.1$ hdfs dfs -ls / 13/09/10 08:15:35 ERROR security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by

Is it possible to get map reduce progress notifications in mongo?

狂风中的少年 提交于 2019-12-10 21:10:56
问题 Map Reduce is slow in Mongo. That is a given. So, I am wondering if it is possible to receive map reduce progress notifications. Thanks. 回答1: I don't know about any built-in features. You could, however, in a separate script run db.currentOp() every once in a while, read map-reduce progress and notify concerned parties. This is an example of what I can see: > db.currentOp() { "inprog" : [ { "opid" : 249198781, "active" : true, "lockType" : "read", "waitingForLock" : false, "secs_running" : 14

MapReduce On Yarn的执行流程

只愿长相守 提交于 2019-12-10 21:07:33
1、概述 Yarn是一个资源调度平台,负责为运算程序提供服务器运算资源,相当于一个分布式的操作系统平台,而MapReduce等运算程序则相当于运行于操作系统之上的应用程序。 Yarn的架构如下图所示: 从Yarn的架构图来看,他主要由ResourceManager、NodeManager、ApplicationMaster和Container等一下几个组件构成。 1)ResourceManager Yarn分层结构的本质是ResourceManager,这个实体控制整个集群并管理应用程序向基础计算资源的分配。Resourcemanager将各个资源(计算,内存,带宽等)精心安排给基础NodeManager。ResourceManager还与ApplicationMaster一起分配资源,与NodeManager一起启动和监视他们的基础应用程序。 总的来说,RM有以下功能: (1)处理客户端的请求 (2)启动和监控ApplicationMaster (3)监控NodeManager (4)资源分配与调度 2)ApplicationMaster ApplicationMaster管理在Yarn内运行的每个应用程序。负责协调来自RM的资源,并通过NodeManager监控容器的执行和资源的使用(CPU、内存等的资源分配)。总体来说,AM有以下作用 (1)负责数据的切分 (2

Not executing my hadoop mapper class while parsing xml in hadoop using XMLInputFormat

▼魔方 西西 提交于 2019-12-10 21:03:19
问题 I am new to hadoop, using Hadoop 2.6.0 version and trying to parse an complex XML. After searching for a while I get to know that for XML parsing we need to write custom InputFormat which is mahout's XMLInputFormat. I also took a help from this example But when I am running my code after passig XMLInputformat class, It will not call my own Mapper class and the output file is having 0 data in it if I use the XMLInputFormat given in the example. Surprisingly if I do not pass my XMLInputFormat

Apache Impala 概念

女生的网名这么多〃 提交于 2019-12-10 20:53:22
Apache Impala Impala基本介绍 impala是cloudera提供的一款高效率的sql查询工具,提供实时的查询效果,官方测试性能比hive快10到100倍,其sql查询比sparkSQL还要更加快速,号称是当前大数据领域最快的查询sql工具。 impala是参照谷歌的新三篇论文(Caffeine–网络搜索引擎、Pregel–分布式图计算、Dremel–交互式分析工具)当中的Dremel实现而来,其中旧三篇论文分别是(BigTable,GFS,MapReduce)分别对应我们即将学的HBase和已经学过的HDFS以及MapReduce。 impala是基于hive并使用内存进行计算,兼顾数据仓库,具有实时,批处理,多并发等优点。 Impala与Hive关系 impala是基于hive的大数据分析查询引擎,直接使用hive的元数据库metadata,意味着impala元数据都存储在hive的metastore当中,并且impala兼容hive的绝大多数sql语法。所以需要安装impala的话,必须先安装hive,保证hive安装成功,并且还需要启动hive的metastore服务。 Hive元数据包含用Hive创建的database、table等元信息。元数据存储在关系型数据库中,如Derby、MySQL等。 客户端连接metastore服务

how to calculate count and unique count over two fields in mongo reduce function

血红的双手。 提交于 2019-12-10 20:39:52
问题 I have a link tracking table that has (amongst other fields) track_redirect and track_userid. I would like to output both the total count for a given link, and also the unique count - counting duplicates by the user id. So we can differentiate if someone has clicked the same link 5 times. I've tried emitting this.track_userid in both the key and values parts but can't get to grips with how to correctly access them in the reduce function. So if I roll back to when it actually worked, I have

Unable to generate jar file for Hadoop

一笑奈何 提交于 2019-12-10 20:29:15
问题 I have 16 Java files and I am trying to generate JAR files for the Hadoop ecosystem using the below command: javac -classpath /usr/local/hadoop/hadoop-core-1.0.3.jar:/usr/local/hadoop/lib/commons-cli-1.2.jar JsonV.java JsonV.java is the class which has main function and this Java file calls other Java files. I am getting this below error, can anybody help me resolve this please? JsonV.java:37: error: cannot find symbol JSONObject obj = new JSONObject(tuple[i]); ^ symbol: class JSONObject

When does an action not run on the driver in Apache Spark?

孤街浪徒 提交于 2019-12-10 20:18:39
问题 I have just started with Spark and was struggling with the concept of tasks. Can any one please help me in understanding when does an action (say reduce) not run in the driver program. From the spark tutorial, "Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. " I'm currently experimenting with an application which reads a directory on 'n'