MapReduce

how to use mapreduce in mongoose/mongodb query subdocument?

只谈情不闲聊 提交于 2019-12-21 02:58:11
问题 I implemented a simple message system in mongoose/mongodb, the schema is like the following var schema = new mongoose.Schema({ user: {type:String, required:true}, updated: {type:Date, default:new Date()}, msgs: [ {m:String, // message itself d:Date, // date of message s: String, // message sender r:Boolean // read or not } ], }); all the messages are stored in msg nested array, now I want to query the messages from certain sender, for example, { "_id" : ObjectId("52c7cbe6d72ecb07f9bbc148"),

How to set the VCORES in hadoop mapreduce/yarn?

倾然丶 夕夏残阳落幕 提交于 2019-12-21 02:42:27
问题 The following are my configuration : **mapred-site.xml** map-mb : 4096 opts:-Xmx3072m reduce-mb : 8192 opts:-Xmx6144m **yarn-site.xml** resource memory-mb : 40GB min allocation-mb : 1GB the Vcores in hadoop cluster displayed 8GB but i dont know how the computation or where to configure it. hope someone could help me. 回答1: Short Answer It most probably doesn't matter, if you are just running hadoop out of the box on your single-node-cluster or even a small personal distributed cluster. You

Permutations with MapReduce

一笑奈何 提交于 2019-12-21 02:41:38
问题 Is there a way to generate permutations with MapReduce? input file: 1 title1 2 title2 3 title3 my goal: 1,2 title1,title2 1,3 title1,title3 2,3 title2,title3 回答1: Since a file will have n inputs, the permutations should have n^2 outputs. It makes sense that you could have n tasks perform n of those operations. I believe you could do this (assuming only for one file): Put your input file into the DistributedCache to be accessible as read-only to your Mapper/Reducers. Make an input split on

How can I partition a table with HIVE?

走远了吗. 提交于 2019-12-21 01:42:31
问题 I've been playing with Hive for few days now but I still have a hard time with partition. I've been recording Apache logs (Combine format) in Hadoop for few months. They are stored in row text format, partitioned by date (via flume): /logs/yyyy/mm/dd/hh/* Example: /logs/2012/02/10/00/Part01xx (02/10/2012 12:00 am) /logs/2012/02/10/00/Part02xx /logs/2012/02/10/13/Part0xxx (02/10/2012 01:00 pm) The date in the combined log file is following this format [10/Feb/2012:00:00:00 -0800] How can I

Programmatically reading the output of Hadoop Mapreduce Program

99封情书 提交于 2019-12-21 01:15:49
问题 This may be a basic question, but I could not find an answer for it on Google. I have a map-reduce job that creates multiple output files in its output directory. My Java application executes this job on a remote hadoop cluster and after the job is finished, it needs to read the output programatically using org.apache.hadoop.fs.FileSystem API. Is it possible? The application knows the output directory, but not the names of the output files generated by the map-reduce job. It seems there is no

Where is Sort used in MapReduce phase and why?

风格不统一 提交于 2019-12-21 01:06:39
问题 I am new to hadoop here. It is not clear why we need to be able to sort by keys while using hadoop mapreduce ? After map phase, we need to distribute the data corresponding to each unique key to some number of reducers. This can be done without having the need to sort it right ? 回答1: It is there, because sorting is a neat trick to group your keys. Of course, if your job or algorithm does not need any order of your keys, then you will be faster to group by some hashing trick. In Hadoop itself,

MapReduce排序总结

為{幸葍}努か 提交于 2019-12-21 00:08:11
【1】Hadoop默认的排序算法,只会针对key值进行排序,按照字典顺序排序 【2】二次排序,在一个数据文件中,首先按照key排序。在key相同的情况下,再按照value大小排序。难度在于要同时参考两列的数据,可以将一行中的两列值封装到bean中。实现WritableComparable接口,重写compareTo进行排序,指定比较规则,实现二次排序,具体可参见 博客 【3】全局排序 1、使用一个Reducer 优点:实现简单 缺点:没有利用分布式 2、重写Partioner类 通过重写Partition类,把key在一个范围内的发往一个固定的Reducer,这样在一个Reducer内key是全排序的,在Reducer之间按照序号也是排好序的。比如key代表的是一个年龄,可以把数据输出到10个Reducer。1-10岁之间发往第0个Reducer,11-20发往第2个Reducer,以此类推。但是这样做有两个缺点: 当数据量大时会出现OOM(内存用完了) 会出现数据倾斜 3、 TotalOrderPartitioner类 Hadoop提供 TotalOrderPartitioner 类用于实现全局排序的功能,并且解决了OOM和数据倾斜的问题。TotalOrderPartitioner类提供了数据采样器,对key值进行部分采样,然后按照采样结果寻找key值的最佳分割点

Name Node stores what?

安稳与你 提交于 2019-12-20 15:30:43
问题 In case of "Name Node", what gets stored in main memory and what gets stored in secondary memory ( hard disk ). What we mean by "file to block mapping" ? What exactly is fsimage and edit logs ? 回答1: In case of "Name Node", what gets stored in main memory and what gets stored in secondary memory ( hard disk ). The file to block mapping, locations of blocks on data nodes, active data nodes, a bunch of other metadata is all stored in memory on the NameNode. When you check the NameNode status

linux下使用eclipse编程mapreduce

冷暖自知 提交于 2019-12-20 15:27:44
配置环境 生效: 启动 3、使用Eclipse: (1)新建项目: File-New-JavaProject 填上Project name 选择Next 导入包: 选择Librarises-Add External JARS (全部过程:) 到usr/local/hadoop/share/hadoop里: Common里的nfs和common-2.7.1.jar common里的lib的全选 Hadoop 下的hdfs的最后三个 Mapreduce 第三个之后的全部 Mapreduce下的lib下的全部: Hadoop下的yarn的第四个开始的全部: 导包全部完成。 (详细过程:) 到usr/local/hadoop/share/hadoop里: Common里的nfs和common-2.7.1.jar common里的lib的全选 Hadoop 下的hdfs的最后三个 Mapreduce 第三个之后的全部 Mapreduce下的lib下的全部: Mapreduce下的lib下的全部: Hadoop下的yarn的第四个开始的全部: 导包全部完成。 点击finish (2)编写java应用程序: 点击刚刚创建好的工程Dedup,右键选择New-Class 在Name里输入Java类的名称,然后选择finish: 出现以下界面,点击创建的java即可编译代码: 填写代码:

Using a CouchDB view, can I count groups and filter by key range at the same time?

坚强是说给别人听的谎言 提交于 2019-12-20 14:15:20
问题 I'm using CouchDB. I'd like to be able to count occurrences of values of specific fields within a date range that can be specified at query time. I seem to be able to do parts of this, but I'm having trouble understanding the best way to pull it all together. Assuming documents that have a timestamp field and another field, e.g.: { date: '20120101-1853', author: 'bart' } { date: '20120102-1850', author: 'homer'} { date: '20120103-2359', author: 'homer'} { date: '20120104-1200', author: 'lisa'