pyspark-sql | 易学教程

PySpark DataFrames: filter where some value is in array column

阅读更多关于 PySpark DataFrames: filter where some value is in array column

问题 I have a DataFrame in PySpark that has a nested array value for one of its fields. I would like to filter the DataFrame where the array contains a certain string. I'm not seeing how I can do that. The schema looks like this: root |-- name: string (nullable = true) |-- lastName: array (nullable = true) | |-- element: string (containsNull = false) I want to return all the rows where the upper(name) == 'JOHN' and where the lastName column (the array) contains 'SMITH' and the equality there

Run Pyspark and Kafka in Jupyter Notebook

阅读更多关于 Run Pyspark and Kafka in Jupyter Notebook

问题 I could run this example in the terminal. My terminal command is: bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 examples/src/main/python/sql/streaming/structured_kafka_wordcount.py localhost:9092 subscribe test Now I wants to run it in Juypter python notebook. I tried to follow this (I could run the code in the link). But in my case, it failed. The following is my code: import os os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages org.apache.spark:spark-sql-kafka-0

Pyspark (1.6.1) SQL.dataframe column to vector aggregation without Hive

阅读更多关于 Pyspark (1.6.1) SQL.dataframe column to vector aggregation without Hive

问题 Suppose my SQL dataframe df is like this: | id | v1 | v2 | |----+----+----| | 1 | 0 | 3 | | 1 | 0 | 3 | | 1 | 0 | 8 | | 4 | 1 | 2 | I want the output to be: | id | v1 | list(v2) | |----+----+--------------| | 1 | [0] | [3,3,8] | | 4 | [1] | [2] | What is the most simple way of doing this with SQL dataframe without Hive? 1) Apparently, with Hive support one could simply use collect_set() and collect_list() aggregate functions. But these functions do not work in plain Spark SqlContext. 2) An

PySpark streaming: window and transform

阅读更多关于 PySpark streaming: window and transform

问题 I'm trying to read in data from a Spark streaming data source, window it by event time, and then run a custom Python function over the windowed data (it uses non-standard Python libraries). My data frame looks something like this: | Time | Value | | 2018-01-01 12:23:50.200 | 1234 | | 2018-01-01 12:23:51.200 | 33 | | 2018-01-01 12:23:53.200 | 998 | | ... | ... | The windowing seems to work nicely with Spark SQL, using something like this: windowed_df = df.groupBy(window("Time", "10 seconds"))

spark streaming: select record with max timestamp for each id in dataframe (pyspark)

阅读更多关于 spark streaming: select record with max timestamp for each id in dataframe (pyspark)

How to store Array or Blob in SnappyData?

阅读更多关于 How to store Array or Blob in SnappyData?

问题 I'm trying to create a table with two columns like below: CREATE TABLE test (col1 INT ,col2 Array<Decimal>) USING column options(BUCKETS '5'); It is creating successfully but when i'm trying to insert data into it, it is not accepting any format of array. I've tried the following queries: insert into test1 values(1,Array(Decimal("1"), Decimal("2"))); insert into test1 values(1,Array(1,2)); insert into test1 values(1,[1,2,1]); insert into test1 values(1,"1,2,1"); insert into test1 values(1,<1

How to use foreach sink in pyspark?

阅读更多关于 How to use foreach sink in pyspark?

问题 How can I use foreach in Python Spark structured streaming to trigger ops on output. query = wordCounts\ .writeStream\ .outputMode('update')\ .foreach(func)\ .start() def func(): ops(wordCounts) 回答1: TL;DR It is not possible to use foreach method in pyspark. Quoting the official documentation of Spark Structured Streaming (highlighting mine): The foreach operation allows arbitrary operations to be computed on the output data. As of Spark 2.1, this is available only for Scala and Java . 回答2:

pyspark.sql.utils.AnalysisException: u'Path does not exist

阅读更多关于 pyspark.sql.utils.AnalysisException: u'Path does not exist

问题 I am running a spark job with amazon emr using the standard hdfs, not S3 to store my files. I have a hive table in hdfs://user/hive/warehouse/ but it cannot be found when my spark job is ran. I configured the spark property spark.sql.warehouse.dir to reflect that of my hdfs directory and while the yarn logs do say: 17/03/28 19:54:05 INFO SharedState: Warehouse path is 'hdfs://user/hive/warehouse/'. later on in the logs it says(full log at end of page): LogType:stdout Log Upload Time:Tue Mar

Spark SQL slow execution with resource idle

阅读更多关于 Spark SQL slow execution with resource idle

问题 I have a Spark SQL that used to execute < 10 mins now running at 3 hours after a cluster migration and need to deep dive on what it's actually doing. I'm new to spark and please don't mind if I'm asking something unrelated. Increased spark.executor.memory but no luck. Env: Azure HDInsight Spark 2.4 on Azure Storage SQL: Read and Join some data and finally write result to a Hive metastore. The spark.sql script ends with below code: .write.mode("overwrite").saveAsTable("default.mikemiketable")

How to check in Python if cell value of pyspark dataframe column in UDF function is none or NaN for implementing forward fill?

阅读更多关于 How to check in Python if cell value of pyspark dataframe column in UDF function is none or NaN for implementing forward fill?

问题 I am basically trying to do a forward fill imputation. Below is the code for that. df = spark.createDataFrame([(1,1, None), (1,2, 5), (1,3, None), (1,4, None), (1,5, 10), (1,6, None)], ('session',"timestamp", "id")) PRV_RANK = 0.0 def fun(rank): ########How to check if None or Nan? ############### if rank is None or rank is NaN: return PRV_RANK else: PRV_RANK = rank return rank fuN= F.udf(fun, IntegerType()) df.withColumn("ffill_new", fuN(df["id"])).show() I am getting weird error in the log.