apache-spark-sql

Apply window function over multiple columns

痴心易碎 提交于 2020-12-08 07:22:32
问题 I would like to perform window function (concretely moving average), but over all columns of a dataframe. I can do it this way from pyspark.sql import SparkSession, functions as func df = ... df.select([func.avg(df[col]).over(windowSpec).alias(col) for col in df.columns]) but I'm afraid this isn't very efficient. Is there a better way to do it? 回答1: An alternative which may be better is to create a new df where you Group By the columns in Window function and apply average on the remaining

Spark DataFrame serialized as invalid json

一笑奈何 提交于 2020-12-06 05:52:51
问题 TL;DR : When I dump a Spark DataFrame as json, I always end up with something like {"key1": "v11", "key2": "v21"} {"key1": "v12", "key2": "v22"} {"key1": "v13", "key2": "v23"} which is invalid json. I can manually edit the dumped file to get something I can parse: [ {"key1": "v11", "key2": "v21"}, {"key1": "v12", "key2": "v22"}, {"key1": "v13", "key2": "v23"} ] but I'm pretty sure I'm missing something that would let me avoid this manual edit. I just don't now what. More details : I have a

Need to Know Partitioning Details in Dataframe Spark

喜夏-厌秋 提交于 2020-12-06 04:37:49
问题 I am trying to read from DB2 database on base of a query. The result set of the query is about 20 - 40 million records. The partition of the DF is done based of a column which is integer. My question is that, once data is loaded how can I check how many records were created per partition. Basically what I want to check is if data skew is happening or not? How can I check the record counts per partition? 回答1: You can for instance map over the partitions and determine their sizes: val rdd = sc

spark dataframe conversion to rdd takes a long time

佐手、 提交于 2020-12-06 01:37:43
问题 I'm reading a json file of a social network into spark. I get from these a data frame which I explode to get pairs. This process works perfect. Later I want to convert this to RDD (For use with GraphX) but the RDD creation takes a very long time. val social_network = spark.read.json(my/path) // 200MB val exploded_network = social_network. withColumn("follower", explode($"followers")). withColumn("id_follower", ($"follower").cast("long")). withColumn("id_account", ($"account").cast("long")).

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

陌路散爱 提交于 2020-12-03 07:37:16
问题 I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0 from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str']) df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date')) df2

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

家住魔仙堡 提交于 2020-12-03 07:35:42
问题 I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0 from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str']) df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date')) df2

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

对着背影说爱祢 提交于 2020-12-03 07:32:47
问题 I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0 from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str']) df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date')) df2

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

老子叫甜甜 提交于 2020-12-03 07:32:39
问题 I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0 from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str']) df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date')) df2

Select columns in Pyspark Dataframe

寵の児 提交于 2020-11-30 06:15:18
问题 I am looking for a way to select columns of my dataframe in pyspark. For the first row, I know I can use df.first() but not sure about columns given that they do not have column names. I have 5 columns and want to loop through each one of them. +--+---+---+---+---+---+---+ |_1| _2| _3| _4| _5| _6| _7| +--+---+---+---+---+---+---+ |1 |0.0|0.0|0.0|1.0|0.0|0.0| |2 |1.0|0.0|0.0|0.0|0.0|0.0| |3 |0.0|0.0|1.0|0.0|0.0|0.0| 回答1: Try something like this: df.select([c for c in df.columns if c in ['_2','

Covid Death Predictions gone wrong [closed]

邮差的信 提交于 2020-11-30 02:01:04
问题 Closed. This question needs debugging details. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 days ago . Improve this question I'm attempting to write a code that will predict fatalities in Toronto due to Covid19...with no luck. I'm sure this has an easy fix that I'm over looking, but I'm too new to spark to know what that is... does anyone have any insight on making this code run-able? Data set is here