pyspark

Py4JJavaError: An error occurred while calling o57.showString. : org.apache.spark.SparkException:

那年仲夏 提交于 2020-08-26 13:37:41
问题 I am working with pyspark connected to an AWS instance (r5d.xlarge 4 vCPUs 32 GiB) running a data base 25 GB, when I run some tables I got the error: Py4JJavaError: An error occurred while calling o57.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.OutOfMemoryError: GC overhead limit exceeded I tried to find out the error for

Transform nested dictionary key values to pyspark dataframe

青春壹個敷衍的年華 提交于 2020-08-26 03:13:39
问题 I have a Pyspark dataframe that looks like this: I would like extract those nested dictionaries in the "dic" column and transform them into PySpark dataframe. Like this: Please let me know how I can achieve this. Thanks! 回答1: from pyspark.sql import functions as F df.show() #sample dataframe +---------+----------------------------------------------------------------------------------------------------------+ |timestmap|dic | +---------+---------------------------------------------------------

Transform nested dictionary key values to pyspark dataframe

自作多情 提交于 2020-08-26 03:12:21
问题 I have a Pyspark dataframe that looks like this: I would like extract those nested dictionaries in the "dic" column and transform them into PySpark dataframe. Like this: Please let me know how I can achieve this. Thanks! 回答1: from pyspark.sql import functions as F df.show() #sample dataframe +---------+----------------------------------------------------------------------------------------------------------+ |timestmap|dic | +---------+---------------------------------------------------------

Filter Pyspark Dataframe with udf on entire row

那年仲夏 提交于 2020-08-25 07:33:49
问题 Is there a way to select the entire row as a column to input into a Pyspark filter udf? I have a complex filtering function "my_filter" that I want to apply to the entire DataFrame: my_filter_udf = udf(lambda r: my_filter(r), BooleanType()) new_df = df.filter(my_filter_udf(col("*")) But col("*") throws an error because that's not a valid operation. I know that I can convert the dataframe to an RDD and then use the RDD's filter method, but I do NOT want to convert it to an RDD and then back

How to use foreach or foreachBatch in PySpark to write to database?

跟風遠走 提交于 2020-08-25 07:04:12
问题 I want to do Spark Structured Streaming (Spark 2.4.x) from a Kafka source to a MariaDB with Python (PySpark). I want to use the streamed Spark dataframe and not the static nor Pandas dataframe. It seems that one has to use foreach or foreachBatch since there are no possible database sinks for streamed dataframes according to https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks. Here is my try: from pyspark.sql import SparkSession import pyspark.sql

How to apply the describe function after grouping a PySpark DataFrame?

大兔子大兔子 提交于 2020-08-25 06:57:09
问题 I want to find the cleanest way to apply the describe function to a grouped DataFrame (this question can also grow to apply any DF function to a grouped DF) I tested grouped aggregate pandas UDF with no luck. There's always a way of doing it by passing each statistics inside the agg function but that's not the proper way. If we have a sample dataframe: df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) The idea would be to do something similar to

accumulator in pyspark with dict as global variable

痞子三分冷 提交于 2020-08-25 05:58:45
问题 Just for learning purpose, I tried to set a dictionary as a global variable in accumulator the add function works well, but I ran the code and put dictionary in the map function, it always return empty. But similar code for setting list as a global variable class DictParam(AccumulatorParam): def zero(self, value = ""): return dict() def addInPlace(self, acc1, acc2): acc1.update(acc2) if __name__== "__main__": sc, sqlContext = init_spark("generate_score_summary", 40) rdd = sc.textFile('input')

How to find weighted sum on top of groupby in pyspark dataframe?

我只是一个虾纸丫 提交于 2020-08-25 02:36:46
问题 I have a dataframe where i need to first apply dataframe and then get weighted average as shown in the output calculation below. What is an efficient way in pyspark to do that? data = sc.parallelize([ [111,3,0.4], [111,4,0.3], [222,2,0.2], [222,3,0.2], [222,4,0.5]] ).toDF(['id', 'val','weight']) data.show() +---+---+------+ | id|val|weight| +---+---+------+ |111| 3| 0.4| |111| 4| 0.3| |222| 2| 0.2| |222| 3| 0.2| |222| 4| 0.5| +---+---+------+ Output: id weigthed_val 111 (3*0.4 + 4*0.3)/(0.4 +