pyspark-sql

PySpark DataFrames: filter where some value is in array column

匆匆过客 提交于 2019-12-08 11:35:24
问题 I have a DataFrame in PySpark that has a nested array value for one of its fields. I would like to filter the DataFrame where the array contains a certain string. I'm not seeing how I can do that. The schema looks like this: root |-- name: string (nullable = true) |-- lastName: array (nullable = true) | |-- element: string (containsNull = false) I want to return all the rows where the upper(name) == 'JOHN' and where the lastName column (the array) contains 'SMITH' and the equality there

Run Pyspark and Kafka in Jupyter Notebook

筅森魡賤 提交于 2019-12-08 07:45:02
问题 I could run this example in the terminal. My terminal command is: bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 examples/src/main/python/sql/streaming/structured_kafka_wordcount.py localhost:9092 subscribe test Now I wants to run it in Juypter python notebook. I tried to follow this (I could run the code in the link). But in my case, it failed. The following is my code: import os os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages org.apache.spark:spark-sql-kafka-0

Pyspark (1.6.1) SQL.dataframe column to vector aggregation without Hive

冷暖自知 提交于 2019-12-08 06:45:24
问题 Suppose my SQL dataframe df is like this: | id | v1 | v2 | |----+----+----| | 1 | 0 | 3 | | 1 | 0 | 3 | | 1 | 0 | 8 | | 4 | 1 | 2 | I want the output to be: | id | v1 | list(v2) | |----+----+--------------| | 1 | [0] | [3,3,8] | | 4 | [1] | [2] | What is the most simple way of doing this with SQL dataframe without Hive? 1) Apparently, with Hive support one could simply use collect_set() and collect_list() aggregate functions. But these functions do not work in plain Spark SqlContext. 2) An

PySpark streaming: window and transform

醉酒当歌 提交于 2019-12-08 05:12:59
问题 I'm trying to read in data from a Spark streaming data source, window it by event time, and then run a custom Python function over the windowed data (it uses non-standard Python libraries). My data frame looks something like this: | Time | Value | | 2018-01-01 12:23:50.200 | 1234 | | 2018-01-01 12:23:51.200 | 33 | | 2018-01-01 12:23:53.200 | 998 | | ... | ... | The windowing seems to work nicely with Spark SQL, using something like this: windowed_df = df.groupBy(window("Time", "10 seconds"))

spark streaming: select record with max timestamp for each id in dataframe (pyspark)

送分小仙女□ 提交于 2019-12-08 05:11:08
问题 I have a dataframe with schema - |-- record_id: integer (nullable = true) |-- Data1: string (nullable = true) |-- Data2: string (nullable = true) |-- Data3: string (nullable = true) |-- Time: timestamp (nullable = true) I wanted to retrieve the last record in the data, grouping by the record_id and with the greatest timestamp. So,if the data is initially this: +----------+---------+---------+---------+-----------------------+ |record_id |Data1 |Data2 |Data3 | Time| +----------+---------+-----

How to store Array or Blob in SnappyData?

↘锁芯ラ 提交于 2019-12-08 04:00:38
问题 I'm trying to create a table with two columns like below: CREATE TABLE test (col1 INT ,col2 Array<Decimal>) USING column options(BUCKETS '5'); It is creating successfully but when i'm trying to insert data into it, it is not accepting any format of array. I've tried the following queries: insert into test1 values(1,Array(Decimal("1"), Decimal("2"))); insert into test1 values(1,Array(1,2)); insert into test1 values(1,[1,2,1]); insert into test1 values(1,"1,2,1"); insert into test1 values(1,<1

How to use foreach sink in pyspark?

≯℡__Kan透↙ 提交于 2019-12-08 03:20:36
问题 How can I use foreach in Python Spark structured streaming to trigger ops on output. query = wordCounts\ .writeStream\ .outputMode('update')\ .foreach(func)\ .start() def func(): ops(wordCounts) 回答1: TL;DR It is not possible to use foreach method in pyspark. Quoting the official documentation of Spark Structured Streaming (highlighting mine): The foreach operation allows arbitrary operations to be computed on the output data. As of Spark 2.1, this is available only for Scala and Java . 回答2:

pyspark.sql.utils.AnalysisException: u'Path does not exist

℡╲_俬逩灬. 提交于 2019-12-08 02:54:06
问题 I am running a spark job with amazon emr using the standard hdfs, not S3 to store my files. I have a hive table in hdfs://user/hive/warehouse/ but it cannot be found when my spark job is ran. I configured the spark property spark.sql.warehouse.dir to reflect that of my hdfs directory and while the yarn logs do say: 17/03/28 19:54:05 INFO SharedState: Warehouse path is 'hdfs://user/hive/warehouse/'. later on in the logs it says(full log at end of page): LogType:stdout Log Upload Time:Tue Mar

Spark SQL slow execution with resource idle

一笑奈何 提交于 2019-12-08 02:29:09
问题 I have a Spark SQL that used to execute < 10 mins now running at 3 hours after a cluster migration and need to deep dive on what it's actually doing. I'm new to spark and please don't mind if I'm asking something unrelated. Increased spark.executor.memory but no luck. Env: Azure HDInsight Spark 2.4 on Azure Storage SQL: Read and Join some data and finally write result to a Hive metastore. The spark.sql script ends with below code: .write.mode("overwrite").saveAsTable("default.mikemiketable")

How to check in Python if cell value of pyspark dataframe column in UDF function is none or NaN for implementing forward fill?

喜欢而已 提交于 2019-12-08 02:12:10
问题 I am basically trying to do a forward fill imputation. Below is the code for that. df = spark.createDataFrame([(1,1, None), (1,2, 5), (1,3, None), (1,4, None), (1,5, 10), (1,6, None)], ('session',"timestamp", "id")) PRV_RANK = 0.0 def fun(rank): ########How to check if None or Nan? ############### if rank is None or rank is NaN: return PRV_RANK else: PRV_RANK = rank return rank fuN= F.udf(fun, IntegerType()) df.withColumn("ffill_new", fuN(df["id"])).show() I am getting weird error in the log.