apache-spark-sql | 易学教程

Apply window function over multiple columns

阅读更多关于 Apply window function over multiple columns

问题 I would like to perform window function (concretely moving average), but over all columns of a dataframe. I can do it this way from pyspark.sql import SparkSession, functions as func df = ... df.select([func.avg(df[col]).over(windowSpec).alias(col) for col in df.columns]) but I'm afraid this isn't very efficient. Is there a better way to do it? 回答1: An alternative which may be better is to create a new df where you Group By the columns in Window function and apply average on the remaining

Spark DataFrame serialized as invalid json

阅读更多关于 Spark DataFrame serialized as invalid json

问题 TL;DR : When I dump a Spark DataFrame as json, I always end up with something like {"key1": "v11", "key2": "v21"} {"key1": "v12", "key2": "v22"} {"key1": "v13", "key2": "v23"} which is invalid json. I can manually edit the dumped file to get something I can parse: [ {"key1": "v11", "key2": "v21"}, {"key1": "v12", "key2": "v22"}, {"key1": "v13", "key2": "v23"} ] but I'm pretty sure I'm missing something that would let me avoid this manual edit. I just don't now what. More details : I have a

Need to Know Partitioning Details in Dataframe Spark

阅读更多关于 Need to Know Partitioning Details in Dataframe Spark

问题 I am trying to read from DB2 database on base of a query. The result set of the query is about 20 - 40 million records. The partition of the DF is done based of a column which is integer. My question is that, once data is loaded how can I check how many records were created per partition. Basically what I want to check is if data skew is happening or not? How can I check the record counts per partition? 回答1: You can for instance map over the partitions and determine their sizes: val rdd = sc

spark dataframe conversion to rdd takes a long time

阅读更多关于 spark dataframe conversion to rdd takes a long time

问题 I'm reading a json file of a social network into spark. I get from these a data frame which I explode to get pairs. This process works perfect. Later I want to convert this to RDD (For use with GraphX) but the RDD creation takes a very long time. val social_network = spark.read.json(my/path) // 200MB val exploded_network = social_network. withColumn("follower", explode($"followers")). withColumn("id_follower", ($"follower").cast("long")). withColumn("id_account", ($"account").cast("long")).

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

阅读更多关于 String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

问题 I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0 from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str']) df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date')) df2

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

阅读更多关于 String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

阅读更多关于 String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

阅读更多关于 String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

Select columns in Pyspark Dataframe

阅读更多关于 Select columns in Pyspark Dataframe

问题 I am looking for a way to select columns of my dataframe in pyspark. For the first row, I know I can use df.first() but not sure about columns given that they do not have column names. I have 5 columns and want to loop through each one of them. +--+---+---+---+---+---+---+ |_1| _2| _3| _4| _5| _6| _7| +--+---+---+---+---+---+---+ |1 |0.0|0.0|0.0|1.0|0.0|0.0| |2 |1.0|0.0|0.0|0.0|0.0|0.0| |3 |0.0|0.0|1.0|0.0|0.0|0.0| 回答1: Try something like this: df.select([c for c in df.columns if c in ['_2','

Covid Death Predictions gone wrong [closed]

阅读更多关于 Covid Death Predictions gone wrong [closed]

问题 Closed. This question needs debugging details. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 days ago . Improve this question I'm attempting to write a code that will predict fatalities in Toronto due to Covid19...with no luck. I'm sure this has an easy fix that I'm over looking, but I'm too new to spark to know what that is... does anyone have any insight on making this code run-able? Data set is here