Cloudera

Hadoop YARN job is getting stucked at map 0% and reduce 0%

烂漫一生 提交于 2019-12-11 01:49:23
问题 I am trying to run a very simple job to test my hadoop setup so I tried with Word Count Example , which get stuck in 0% , so i tried some other simple jobs and each one of them stuck 52191_0003/ 14/07/14 23:55:51 INFO mapreduce.Job: Running job: job_1405376352191_0003 14/07/14 23:55:57 INFO mapreduce.Job: Job job_1405376352191_0003 running in uber mode : false 14/07/14 23:55:57 INFO mapreduce.Job: map 0% reduce 0% I am using hadoop version- Hadoop 2.3.0-cdh5.0.2 I did quick research on Google

fs.defaultFS only listens to localhost's port 8020

倾然丶 夕夏残阳落幕 提交于 2019-12-11 01:14:19
问题 I have a CDH4.3 all-in-one vm up and running, i am trying to install a hadoop client remotely. I noticed that, without changing any default settings, my hadoop cluster is listening to 127.0.0.1:8020 . [cloudera@localhost ~]$ netstat -lent | grep 8020 tcp 0 0 127.0.0.1:8020 0.0.0.0:* LISTEN 492 100202 [cloudera@localhost ~]$ telnet ${all-in-one vm external IP} 8020 Trying ${all-in-one vm external IP}... telnet: connect to address ${all-in-one vm external IP} Connection refused [cloudera

HBase: /hbase/meta-region-server node does not exist

旧巷老猫 提交于 2019-12-11 00:38:23
问题 I have installed cloudera and hdfs, mapreduce, zookeper, hbase on it. 4 nodes with these services (3 zookeeper). All are installed by cloudera wizard and have no configuration issues in cloudera. On connect from java I have got an error: 9:32:23.020 [main-SendThread()] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server /172.20.7.6:2181 09:32:23.020 [main] INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper - Process identifier=hconnection-0x301abf87 connecting

Running any Hadoop command fails after enabling security.

末鹿安然 提交于 2019-12-10 21:29:48
问题 I was trying to enable Kerberos for my CDH 4.3 (via Cloudera Manager) test bed. So after changing authentication from Simple to Kerberos in the WebUI, I'm unable to do any hadoop operations as shown below. Is there anyway to specify the keytab explicitly? [root@host-dn15 ~]# su - hdfs -bash-4.1$ hdfs dfs -ls / 13/09/10 08:15:35 ERROR security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by

Comma delimited string to individual rows - Impala SQL

北慕城南 提交于 2019-12-10 21:26:08
问题 Let's suppose we have a table: Owner | Pets ------------------------------ Jack | "dog, cat, crocodile" Mary | "bear, pig" I want to get as a result: Owner | Pets ------------------------------ Jack | "dog" Jack | "cat" Jack | "crocodile" Mary | "bear" Mary | "pig" I found some solutions to similar problems by googling, but Impala SQL does not offer any of these capabilities to apply the suggested solutions. Any help would be greatly appreciated! 回答1: The following works in Impala: split_part

Copy Solr HDFS Data to another Cluster

十年热恋 提交于 2019-12-10 20:07:33
问题 I have a solr cloud (v 4.10) installation that sits on top of Cloudera (CDH 5.4.2) HDFS with 3 solr instances each hosting a shard of each core. I am looking for a way to incrementally copy the solr data from our production cluster to our development cluster. There are 3 cores but I am only interested in copying one of them. I have tried to use the Solr replication - backup and restore but that doesn't seem to load anything into the dev cluster. http://host:8983/solr/core/replication?command

Sqoop job fails with KiteSDK validation error for Oracle import

僤鯓⒐⒋嵵緔 提交于 2019-12-10 19:46:39
问题 I am attempting to run a Sqoop job to load from an Oracle db and into Parquet format to a Hadoop cluster. The job is incremental. Sqoop version is 1.4.6. Oracle version is 12c. Hadoop version is 2.6.0 (distro is Cloudera 5.5.1). The Sqoop command is (this creates the job, and executes it): $ sqoop job -fs hdfs://<HADOOPNAMENODE>:8020 \ --create myJob \ -- import \ --connect jdbc:oracle:thin:@<DBHOST>:<DBPORT>/<DBNAME> \ --username <USERNAME> \ -P \ --as-parquetfile \ --table <USERNAME>.

hadoop, python, subprocess failed with code 127

谁说胖子不能爱 提交于 2019-12-10 15:18:04
问题 I'm trying to run very simple task with mapreduce. mapper.py: #!/usr/bin/env python import sys for line in sys.stdin: print line my txt file: qwerty asdfgh zxc Command line to run the job: hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.8.0.jar \ -input /user/cloudera/In/test.txt \ -output /user/cloudera/test \ -mapper /home/cloudera/Documents/map.py \ -file /home/cloudera/Documents/map.py Error: INFO mapreduce.Job: Task Id : attempt_1490617885665

本地 vs 云:大数据厮杀的最终幸存者会是谁?— InfoQ专访阿里云智能通用计算平台负责人关涛

℡╲_俬逩灬. 提交于 2019-12-10 05:14:17
摘要: 本地大数据服务是否进入消失倒计时?云平台大数据服务最终到底会趋向多云、混合云还是单一公有云?集群规模增大,上云成本将难以承受是误区还是事实?InfoQ 将就上述问题对阿里云智能通用计算平台负责人关涛进行了专访。 作者:赵钰莹 原文标题 本地 vs 云:大数据厮杀的最终幸存者会是谁? 一家企业什么时候会决定上云?过去,这个问题的答案可能是当企业发现需要购买新的硬件进行新一轮资本投入时,往往倾向于考虑另一种替代方案,比如云,这可能更多还是从成本方面考虑;或者,当企业出现某种弹性计算需求时,云平台是非常好的实现 IT 资源“削峰”的方案。 不同于现有技术边界的“替换”,如今,这个问题的答案可以再加上一条:技术边界的“扩张”。当企业需要某种能力,比如 AI 或者大数据,但自身技术实力达不到或者企业核心竞争力不在技术本身,此时就可能会考虑上云,甚至这已经成为不少企业选择云平台的重要原因。通过选择云平台,企业实现了自己技术边界的扩张,从而为业务边界扩张提供技术上的保障。 过去几年,云平台大数据服务越来越成熟,单就这一项,主流云厂商可提供的服务列表就达到数十种,本地大数据服务的声音似乎越来越弱,这在 Cloudera 与 Hortonworks 合并之后尤为明显。有分析人士指出,Hadoop 与 Spark/Flink 等流技术的融合已经在云平台发生,这让 Cloudera 和

Where Mapper output in Hadoop is saved?

拥有回忆 提交于 2019-12-09 21:18:21
问题 I am interested in efficiently manage the Hadoop shuffling traffic and utilize the network bandwidth effectively. To do this I want to know how much shuffling traffic generated by each Datanodes ? Shuffling traffic is nothing but the output of mappers. So where this mapper output is saved ? How can i get the size of mapper output from each datanodes in a real time ? Appreciate your help. I have created a directory to store this mapper output as below. <property> <name>mapred.local.dir</name>