pyspark

Can a PySpark Kernel(JupyterHub) run in yarn-client mode?

大兔子大兔子 提交于 2021-02-08 10:34:00
问题 My Current Setup: Spark EC2 Cluster with HDFS and YARN JuputerHub(0.7.0) PySpark Kernel with python27 The very simple code that I am using for this question: rdd = sc.parallelize([1, 2]) rdd.collect() The PySpark kernel that works as expected in Spark standalone has the following environment variable in the kernel json file: "PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell" However, when I try to run in yarn-client mode it is getting stuck forever, while the log

Can a PySpark Kernel(JupyterHub) run in yarn-client mode?

佐手、 提交于 2021-02-08 10:31:57
问题 My Current Setup: Spark EC2 Cluster with HDFS and YARN JuputerHub(0.7.0) PySpark Kernel with python27 The very simple code that I am using for this question: rdd = sc.parallelize([1, 2]) rdd.collect() The PySpark kernel that works as expected in Spark standalone has the following environment variable in the kernel json file: "PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell" However, when I try to run in yarn-client mode it is getting stuck forever, while the log

How to extract column name and column type from SQL in pyspark

孤街醉人 提交于 2021-02-08 10:01:53
问题 The Spark SQL for Create query is like this - CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1 col_type1 [COMMENT col_comment1], ...)] USING datasource [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] where [x] means x is optional. I want the output as a tuple of

How to extract column name and column type from SQL in pyspark

心已入冬 提交于 2021-02-08 10:01:41
问题 The Spark SQL for Create query is like this - CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1 col_type1 [COMMENT col_comment1], ...)] USING datasource [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] where [x] means x is optional. I want the output as a tuple of

Convert Scala code to Pyspark :Word2Vec Scala Tranform Routine

泄露秘密 提交于 2021-02-08 10:01:31
问题 I want to translate following routine from class [Word2VecModel]https://github.com/apache/spark/blob/branch-2.3/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala into pyspark. override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) val vectors = wordVectors.getVectors .mapValues(vv => Vectors.dense(vv.map(_.toDouble))) .map(identity) // mapValues doesn't return a serializable map (SI-7005) val bVectors = dataset.sparkSession

Calculate UDF once

匆匆过客 提交于 2021-02-08 10:00:12
问题 I want to have a UUID column in a pyspark dataframe that is calculated only once, so that I can select the column in a different dataframe and have the UUIDs be the same. However, the UDF for the UUID column is recalculated when I select the column. Here's what I'm trying to do: >>> uuid_udf = udf(lambda: str(uuid.uuid4()), StringType()) >>> a = spark.createDataFrame([[1, 2]], ['col1', 'col2']) >>> a = a.withColumn('id', uuid_udf()) >>> a.collect() [Row(col1=1, col2=2, id='5ac8f818-e2d8-4c50

pyspark structured streaming write to parquet in batches

牧云@^-^@ 提交于 2021-02-08 09:51:55
问题 I am doing some transformation on the spark structured streaming dataframe. I am storing the transformed dataframe as parquet files in hdfs. Now I want that the write to hdfs should happen in batches instead of transforming the whole dataframe first and then storing the dataframe. 回答1: Here is a parquet sink example: # parquet sink example targetParquetHDFS = sourceTopicKAFKA .writeStream .format("parquet") # can be "orc", "json", "csv", etc. .outputMode("append") # can only be "append"

pyspark structured streaming write to parquet in batches

拟墨画扇 提交于 2021-02-08 09:51:54
问题 I am doing some transformation on the spark structured streaming dataframe. I am storing the transformed dataframe as parquet files in hdfs. Now I want that the write to hdfs should happen in batches instead of transforming the whole dataframe first and then storing the dataframe. 回答1: Here is a parquet sink example: # parquet sink example targetParquetHDFS = sourceTopicKAFKA .writeStream .format("parquet") # can be "orc", "json", "csv", etc. .outputMode("append") # can only be "append"

duplicating records between date gaps within a selected time interval in a PySpark dataframe

守給你的承諾、 提交于 2021-02-08 09:45:10
问题 I have a PySpark dataframe that keeps track of changes that occur in a product's price and status over months. This means that a new row is created only when a change occurred (in either status or price) compared to the previous month, like in the dummy data below ---------------------------------------- |product_id| status | price| month | ---------------------------------------- |1 | available | 5 | 2019-10| ---------------------------------------- |1 | available | 8 | 2020-08| ------------

INSERT & UPDATE MySql table using PySpark DataFrames and JDBC

故事扮演 提交于 2021-02-08 09:36:06
问题 I'm trying to insert and update some data on MySql using PySpark SQL DataFrames and JDBC connection. I've succeeded to insert new data using the SaveMode.Append. Is there a way to update the existing data and insert new data in MySql Table from PySpark SQL? My code to insert is: myDataFrame.write.mode(SaveMode.Append).jdbc(JDBCurl,mySqlTable,connectionProperties) If I change to SaveMode.Overwrite it deletes the full table and creates a new one, I'm looking for something like the "ON DUPLICATE