cloudera-cdh

“No common protection layer between client and server” while trying to communicate with kerberized Hadoop cluster

爷,独闯天下 提交于 2019-12-06 12:26:53
问题 I'm trying to communicate programmatically to a Hadoop cluster which is kerberized (CDH 5.3/HDFS 2.5.0). I have a valid Kerberos token on the client side. But I'm getting an error as below, "No common protection layer between client and server". What does this error mean and are there any ways to fix or work around it? Is this something related to HDFS-5688? The ticket seems to imply that the property "hadoop.rpc.protection" must be set, presumably to "authentication" (also per e.g. this).

spark-submit yarn-cluster with --jars does not work?

久未见 提交于 2019-12-05 18:48:52
I am trying to submit a spark job to the CDH yarn cluster via the following commands I have tried several combinations and it all does not work... I now have all the poi jars located in both my local /root, as well as HDFS /user/root/lib, hence I have tried the following spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars /root/poi-3.12.jars, /root/poi-ooxml-3.12.jar, /root/poi-ooxml-schemas-3.12.jar spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars file:/root/poi-3.12.jars, file:/root/poi-ooxml-3.12.jar, file:/root/poi-ooxml-schemas-3.12

Compare data in two RDD in spark

半城伤御伤魂 提交于 2019-12-05 01:57:54
问题 I am able to print data in two RDD with the below code. usersRDD.foreach(println) empRDD.foreach(println) I need to compare data in two RDDs. How can I iterate and compare field data in one RDD with field data in another RDD. Eg: iterate the records and check if name and age in userRDD has a matching record in empRDD , if no put in separate RDD. I tried with userRDD.substract(empRDD) but it was comparing all the fields. 回答1: You need to key the data in each RDD so that there is something to

Cloudera/CDH v6.1.x + Python HappyBase v1.1.0: TTransportException(type=4, message='TSocket read 0 bytes')

浪子不回头ぞ 提交于 2019-12-04 21:58:53
EDIT: This question and answer applies to anyone who is experiencing the exception stated in the subject line: TTransportException(type=4, message='TSocket read 0 bytes') ; whether or not Cloudera and/or HappyBase is involved. The root issue (as it turned out) stems from mismatching protocol and/or transport formats on the client-side with what the server-side is implementing, and this can happen with any client/server paring. Mine just happened to be Cloudera and HappyBase, but yours needn't be and you can run into this same issue. Has anyone recently tried using the happybase v1.1.0 (latest)

Job keeps running in LocalJobRunner under Cloudera 5.1

陌路散爱 提交于 2019-12-04 06:16:07
问题 Need some quick help. Our job runs fine under MapR, but when we start the same job on Cloudera 5.1, it keeps running in Local mode. I am sure this is some kind of configuration issue. Which config setting is it? 14/08/22 12:16:58 INFO mapreduce.Job: map 0% reduce 0% 14/08/22 12:17:03 INFO mapred.LocalJobRunner: map > map 14/08/22 12:17:06 INFO mapred.LocalJobRunner: map > map 14/08/22 12:17:09 INFO mapred.LocalJobRunner: map > map Thanks. 回答1: Problem was that Cloudera 5.1 runs 'Yarn'

Getting error while building the Apache Zeppelin

空扰寡人 提交于 2019-12-04 05:37:40
问题 I have my hadoop already setup with cloudera. I wanted to install zeppelin to connect with hive and build the UI for my queries. While building the zeppelin command with the following command: sudo mvn clean package -Pspark-1.3 -Dspark.version=1.3.0 -Dhadoop.version=2.6.0-cdh5.4.7 -Phadoop-2.6 -Pyarn -DskipTests I get this error at the web-application module : [ERROR] npm ERR! Linux 3.19.0-71-generic [ERROR] npm ERR! argv "/home/zeppelin/incubator-zeppelin/zeppelin-web/node/node" "/home

Spark cache RDD don't show up on Spark History WebUI - Storage

孤人 提交于 2019-12-04 01:56:25
问题 I am using Spark-1.4.1 in CDH-5.4.4 . I use rdd.cache() function but it show nothing in Storage tab on Spark History WebUI Does anyone has the same issues? How to fix it? 回答1: Your RDD will only be cached once its been evaluated, the most common way to force evaluation (and therefor populate the cache) is to call count e.g: rdd.cache() // Nothing in storage page yet & nothing cached rdd.count() // RDD evaluated, cached & in storage page. 来源: https://stackoverflow.com/questions/31715698/spark

Compare data in two RDD in spark

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-03 17:36:51
I am able to print data in two RDD with the below code. usersRDD.foreach(println) empRDD.foreach(println) I need to compare data in two RDDs. How can I iterate and compare field data in one RDD with field data in another RDD. Eg: iterate the records and check if name and age in userRDD has a matching record in empRDD , if no put in separate RDD. I tried with userRDD.substract(empRDD) but it was comparing all the fields. Sean Owen You need to key the data in each RDD so that there is something to join records on. Have a look at groupBy for example. Then you join the resulting RDDs. For each key

Exclusion of dependency of spark-core in CDH

不羁岁月 提交于 2019-12-02 08:33:22
I'm using Structured Spark Streaming to write to HBase data coming from Kafka. My cluster distribution is : Hadoop 3.0.0-cdh6.2.0, and i'm using Spark 2.4.0 My code is like below : val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", bootstrapServers) .option("subscribe", topic) .option("failOnDataLoss", false) .load() .selectExpr("CAST(key AS STRING)" , "CAST(value AS STRING)") .as(Encoders.STRING) df.writeStream .foreachBatch { (batchDF: Dataset[Row], batchId: Long) => batchDF.write .options(Map(HBaseTableCatalog.tableCatalog->catalog, HBaseTableCatalog.newTable ->

Is it possible to load parquet table directly from file?

岁酱吖の 提交于 2019-12-02 04:47:43
问题 If I have a binary data file(it can be converted to csv format), Is there any way to load parquet table directly from it? Many tutorials show loading csv file to text table, and then from text table to parquet table. From efficiency point of view, is it possible to load parquet table directly from either a binary file like what I already have? Ideally using create external table command. Or I need to convert it to csv file first? Is there any file format restriction? 回答1: Unfortunately it is