apache-spark

Spark-HBase - GCP template (3/3) - Missing libraries?

跟風遠走 提交于 2021-01-15 19:36:07
问题 I'm trying to test the Spark-HBase connector in the GCP context and tried to follow the instructions, which asks to locally package the connector, and I get the following error when submitting the job on Dataproc (after having completed these steps). Command (base) gcloud dataproc jobs submit spark --cluster $SPARK_CLUSTER --class com.example.bigtable.spark.shc.BigtableSource --jars target/scala-2.11/cloud-bigtable-dataproc-spark-shc-assembly-0.1.jar --region us-east1 -- $BIGTABLE_TABLE Error

Spark-HBase - GCP template (3/3) - Missing libraries?

☆樱花仙子☆ 提交于 2021-01-15 19:36:06
问题 I'm trying to test the Spark-HBase connector in the GCP context and tried to follow the instructions, which asks to locally package the connector, and I get the following error when submitting the job on Dataproc (after having completed these steps). Command (base) gcloud dataproc jobs submit spark --cluster $SPARK_CLUSTER --class com.example.bigtable.spark.shc.BigtableSource --jars target/scala-2.11/cloud-bigtable-dataproc-spark-shc-assembly-0.1.jar --region us-east1 -- $BIGTABLE_TABLE Error

How to calculate difference between dates excluding weekends in Pyspark 2.2.0

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-07 06:50:49
问题 I have the below pyspark df which can be recreated by the code df = spark.createDataFrame([(1, "John Doe", "2020-11-30"),(2, "John Doe", "2020-11-27"),(3, "John Doe", "2020-11-29")], ("id", "name", "date")) +---+--------+----------+ | id| name| date| +---+--------+----------+ | 1|John Doe|2020-11-30| | 2|John Doe|2020-11-27| | 3|John Doe|2020-11-29| +---+--------+----------+ I am looking to create a udf to calculate difference between 2 rows of dates (using Lag function) excluding weekends as

How to find the argmax of a vector in PySpark ML

∥☆過路亽.° 提交于 2021-01-07 05:48:27
问题 My model has output a DenseVector column, and I'd like to find the argmax. This page suggests this function should be available, but I'm not sure what the syntax should be. Is it df.select("mycolumn").argmax() ? 回答1: I could not find the documents for argmax operation in python. but you can do them by converting them to arrays For pyspark 3.0.0 from pyspark.ml.functions import vector_to_array tst_arr = tst_df.withColumn("arr",vector_to_array(F.col('vector_column'))) tst_max=tst_arr.withColumn

Greenplum-Spark-Connector java.util.NoSuchElementException: None.get

孤者浪人 提交于 2021-01-07 04:04:43
问题 My work Envorinments like bellow: . Hadoop 2.7.2 . Spark 2.3.0 . Greenplum 6.8.1 <- I knew this version is latest. and I have to create dataframe(RDD) from GPDB table. so, I have knew a "Greenplum-spark-connector". An architecture sounds good. but It does not work. I tried like this: spark/bin$spark-shell --master spark://10.40.203.99:7077 --jars /data2/install_files/greenplum-spark_2.11-1.6.2.jar,/data2/install_files/postgresql-42.2.5.jar,/data2/install_files/jetty-io-9.2.26.v20180806.jar,

Spark optimization - joins - very low number of task - OOM

蓝咒 提交于 2021-01-07 03:59:30
问题 My spark application fail with this error : Exit status: 143. Diagnostics: Container killed on request. Exit code is 143 This is what i get when I inspect the containger log : java.lang.OutOfMemoryError: Java heap space My application is mainly get a table then join differents tables that i read from aws S3: var result = readParquet(table1) val table2 = readParquet(table2) result = result.join(table2 , result(primaryKey) === table2(foreignKey)) val table3 = readParquet(table3) result = result

sparklyr mutate behaviour with stringr

守給你的承諾、 提交于 2021-01-07 03:52:43
问题 I am trying to use sparklyr to process a parquet file. the table is of structure: type:str | type:str | type:str key | requestid | operation I am running the code: txt %>% select(key, requestid, operation) %>% mutate(object = stringr::str_split(key, '/', simplify=TRUE) %>% dplyr::last() ) where txt is a valid spark frame I get: Error in stri_split_regex(string, pattern, n = n, simplify = simplify, : object 'key' not found Traceback: 1. txt2 %>% select(key, requestid, operation) %>% mutate

Apache Spark SQL get_json_object java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String

别说谁变了你拦得住时间么 提交于 2021-01-07 03:38:30
问题 I am trying to read a json stream from an MQTT broker in Apache Spark with structured streaming, read some properties of an incoming json and output them to the console. My code looks like that: val spark = SparkSession .builder() .appName("BahirStructuredStreaming") .master("local[*]") .getOrCreate() import spark.implicits._ val topic = "temp" val brokerUrl = "tcp://localhost:1883" val lines = spark.readStream .format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider") .option(

Apache Spark SQL get_json_object java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String

左心房为你撑大大i 提交于 2021-01-07 03:38:27
问题 I am trying to read a json stream from an MQTT broker in Apache Spark with structured streaming, read some properties of an incoming json and output them to the console. My code looks like that: val spark = SparkSession .builder() .appName("BahirStructuredStreaming") .master("local[*]") .getOrCreate() import spark.implicits._ val topic = "temp" val brokerUrl = "tcp://localhost:1883" val lines = spark.readStream .format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider") .option(

Getting “org.apache.zeppelin.interpreter.InterpreterException: java.io.IOException: Interpreter process is not running null”

こ雲淡風輕ζ 提交于 2021-01-07 02:49:56
问题 Hi I am on Docker on mac[K8 enabled] and trying to deploy Zeppelin on K8 by following https://zeppelin.apache.org/docs/0.9.0-SNAPSHOT/quickstart/kubernetes.html. After deploying the zeppelin server on K8, I am trying to run the Spark example but getting following exception: org.apache.zeppelin.interpreter.InterpreterException: java.io.IOException: Interpreter process is not running null at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:134) at org.apache