Hadoop

Kafka案例

偶尔善良 提交于 2021-02-20 16:43:28
假设我们正在开发一个多人在线网络游戏。游戏中,玩家需要在虚拟世界中进行合作或者展开竞争。玩家之间也常常存在各种交易,包括钱、道具等。因此游戏开发人员必须保证玩家没有作弊,规则如下:如果玩家交易额显著高于正常水平或者玩家登录的IP地址与之前20次登录的不一样,那么交易将被标记可疑。除了实时标记交易以外,我们还希望可以将这些数据导入到Apache Hadoop以方便数据科学家训练、测试他们的算法与模型。 为了提高实时的事件标记的效率,我们尽可能利用游戏服务器的内存。游戏系统包含了多台游戏服务器,因此在设计中,我们会在内存中保存每一个用户最近的20次登录记录以及最新的20次交易明细(数据是分布式存储的)。 游戏服务器主要扮演两个不同的角色:接收并传播用户行为,实时处理交易信息并对可疑事件进行标记。为了高效的扮演第二个角色,我们需要将任何一个用户的交易历史保存在一台服务器的内存中。这就意味着我们不得不在服务器之间传输消息,毕竟接收用户行为数据的服务器未必包含了该用户的交易历史。为了保持角色之间的松偶和,我们利用Kafka在不同服务器之间传输消息。 Kafka的特性使得它可以很好的满足我们的需求:可扩展、数据分区、低延迟以及处理大量异构消费者的能力。在该案例中,我们为登录与交易处理定义了一个主题。之所以使用同一个主题主要是因为我们希望在处理交易事件前已经获得用户登录信息了

Hadoop Capacity Scheduler and Spark

╄→гoц情女王★ 提交于 2021-02-20 04:22:05
问题 If I define CapacityScheduler Queues in yarn as explained here http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html how do I make spark use this? I want to run spark jobs... but they should not take up all the cluster but instead execute on a CapacityScheduler which has a fixed set of resources allocated to it. Is that possible ... specifically on the cloudera platform (given that spark on cloudera runs on yarn?). 回答1: You should configure the

JQ, Hadoop: taking command from a file

房东的猫 提交于 2021-02-19 08:33:00
问题 I have been enjoying the powerful filters provided by JQ (Doc). Twitter's public API gives nicely formatted json files. I have access to a large amount of it, and I have access to a Hadoop cluster. There I decided to, instead of loading them in Pig using Elephantbird , try out JQ in mapper streaming to see if it is any faster. Here is my final query: nohup hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar\ -files $HOME/bin/jq \ -D mapreduce.map.memory.mb=2048\ -D

JQ, Hadoop: taking command from a file

Deadly 提交于 2021-02-19 08:32:21
问题 I have been enjoying the powerful filters provided by JQ (Doc). Twitter's public API gives nicely formatted json files. I have access to a large amount of it, and I have access to a Hadoop cluster. There I decided to, instead of loading them in Pig using Elephantbird , try out JQ in mapper streaming to see if it is any faster. Here is my final query: nohup hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar\ -files $HOME/bin/jq \ -D mapreduce.map.memory.mb=2048\ -D

How to open HDFS output file using gedit?

心不动则不痛 提交于 2021-02-19 06:35:17
问题 I have installed and executed an mapreduce program successfully in my system(Ubuntu 14.04). I can see the output file as, hadoopuser@arul-PC:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hadoopuser/MapReduceSample-output Found 3 items -rw-r--r-- 1 hadoopuser supergroup 0 2014-07-09 16:10 /user/hadoopuser/MapReduceSample-output/_SUCCESS drwxr-xr-x - hadoopuser supergroup 0 2014-07-09 16:10 /user/hadoopuser/MapReduceSample-output/_logs -rw-r--r-- 1 hadoopuser supergroup 880838 2014-07-09 16:10

Optimizing Hive GROUP BY when rows are sorted

北城余情 提交于 2021-02-19 05:31:39
问题 I have the following (very simple) Hive query: select user_id, event_id, min(time) as start, max(time) as end, count(*) as total, count(interaction == 1) as clicks from events_all group by user_id, event_id; The table has the following structure: user_id event_id time interaction Ex833Lli36nxTvGTA1Dv juCUv6EnkVundBHSBzQevw 1430481530295 0 Ex833Lli36nxTvGTA1Dv juCUv6EnkVundBHSBzQevw 1430481530295 1 n0w4uQhOuXymj5jLaCMQ G+Oj6J9Q1nI1tuosq2ZM/g 1430512179696 0 n0w4uQhOuXymj5jLaCMQ G

Hadoop2- YARN - ApplicationMaster UI - Connection refused issue

浪尽此生 提交于 2021-02-19 05:26:45
问题 I'm getting below issue while accessing ApplicationMaster UI from the RM WebUI (hadoop 2.6.0). There is no standalone WebProxy server running. The Proxy is running as a part of ResourceManager. "HTTP ERROR 500 Problem accessing /proxy/application_1431357703844_0004/. Reason: Connection refused" Log entries in resourcemanager logs: 2015-05-11 19:25:01,837 INFO webproxy.WebAppProxyServlet (WebAppProxyServlet.java:doGet(330)) - ubuntu is accessing unchecked http://slave1:51704/ which is the app

Hadoop2- YARN - ApplicationMaster UI - Connection refused issue

眉间皱痕 提交于 2021-02-19 05:26:44
问题 I'm getting below issue while accessing ApplicationMaster UI from the RM WebUI (hadoop 2.6.0). There is no standalone WebProxy server running. The Proxy is running as a part of ResourceManager. "HTTP ERROR 500 Problem accessing /proxy/application_1431357703844_0004/. Reason: Connection refused" Log entries in resourcemanager logs: 2015-05-11 19:25:01,837 INFO webproxy.WebAppProxyServlet (WebAppProxyServlet.java:doGet(330)) - ubuntu is accessing unchecked http://slave1:51704/ which is the app

Accessing HDInsight Hive with python

落花浮王杯 提交于 2021-02-19 05:19:39
问题 We have a HDInsight cluster with some tables in HIVE. I want to query these tables from Python 3.6 from a client machine (outside Azure). I have tried using PyHive , pyhs2 and also impyla but I am running into various problems with all of them. Does anybody have a working example of accessing a HDInsight HIVE from Python? I have very little experience with this, and don't know how to configure PyHive (which seems the most promising), especially regarding authorization. With impyla : from

Accessing HDInsight Hive with python

谁都会走 提交于 2021-02-19 05:19:26
问题 We have a HDInsight cluster with some tables in HIVE. I want to query these tables from Python 3.6 from a client machine (outside Azure). I have tried using PyHive , pyhs2 and also impyla but I am running into various problems with all of them. Does anybody have a working example of accessing a HDInsight HIVE from Python? I have very little experience with this, and don't know how to configure PyHive (which seems the most promising), especially regarding authorization. With impyla : from