pyspark | 易学教程

Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space

阅读更多关于 Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space

来源： https://stackoverflow.com/questions/63103302/creating-dictionary-from-pyspark-dataframe-showing-outofmemoryerror-java-heap-s

Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space

阅读更多关于 Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space

来源： https://stackoverflow.com/questions/63103302/creating-dictionary-from-pyspark-dataframe-showing-outofmemoryerror-java-heap-s

Py4JJavaError: An error occurred while calling o57.showString. : org.apache.spark.SparkException:

阅读更多关于 Py4JJavaError: An error occurred while calling o57.showString. : org.apache.spark.SparkException:

问题 I am working with pyspark connected to an AWS instance (r5d.xlarge 4 vCPUs 32 GiB) running a data base 25 GB, when I run some tables I got the error: Py4JJavaError: An error occurred while calling o57.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.OutOfMemoryError: GC overhead limit exceeded I tried to find out the error for

Transform nested dictionary key values to pyspark dataframe

阅读更多关于 Transform nested dictionary key values to pyspark dataframe

问题 I have a Pyspark dataframe that looks like this: I would like extract those nested dictionaries in the "dic" column and transform them into PySpark dataframe. Like this: Please let me know how I can achieve this. Thanks! 回答1: from pyspark.sql import functions as F df.show() #sample dataframe +---------+----------------------------------------------------------------------------------------------------------+ |timestmap|dic | +---------+---------------------------------------------------------

Transform nested dictionary key values to pyspark dataframe

阅读更多关于 Transform nested dictionary key values to pyspark dataframe

Filter Pyspark Dataframe with udf on entire row

阅读更多关于 Filter Pyspark Dataframe with udf on entire row

问题 Is there a way to select the entire row as a column to input into a Pyspark filter udf? I have a complex filtering function "my_filter" that I want to apply to the entire DataFrame: my_filter_udf = udf(lambda r: my_filter(r), BooleanType()) new_df = df.filter(my_filter_udf(col("*")) But col("*") throws an error because that's not a valid operation. I know that I can convert the dataframe to an RDD and then use the RDD's filter method, but I do NOT want to convert it to an RDD and then back

How to use foreach or foreachBatch in PySpark to write to database?

阅读更多关于 How to use foreach or foreachBatch in PySpark to write to database?

问题 I want to do Spark Structured Streaming (Spark 2.4.x) from a Kafka source to a MariaDB with Python (PySpark). I want to use the streamed Spark dataframe and not the static nor Pandas dataframe. It seems that one has to use foreach or foreachBatch since there are no possible database sinks for streamed dataframes according to https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks. Here is my try: from pyspark.sql import SparkSession import pyspark.sql

How to apply the describe function after grouping a PySpark DataFrame?

阅读更多关于 How to apply the describe function after grouping a PySpark DataFrame?

问题 I want to find the cleanest way to apply the describe function to a grouped DataFrame (this question can also grow to apply any DF function to a grouped DF) I tested grouped aggregate pandas UDF with no luck. There's always a way of doing it by passing each statistics inside the agg function but that's not the proper way. If we have a sample dataframe: df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) The idea would be to do something similar to

accumulator in pyspark with dict as global variable

阅读更多关于 accumulator in pyspark with dict as global variable

问题 Just for learning purpose, I tried to set a dictionary as a global variable in accumulator the add function works well, but I ran the code and put dictionary in the map function, it always return empty. But similar code for setting list as a global variable class DictParam(AccumulatorParam): def zero(self, value = ""): return dict() def addInPlace(self, acc1, acc2): acc1.update(acc2) if __name__== "__main__": sc, sqlContext = init_spark("generate_score_summary", 40) rdd = sc.textFile('input')

How to find weighted sum on top of groupby in pyspark dataframe?

阅读更多关于 How to find weighted sum on top of groupby in pyspark dataframe?

问题 I have a dataframe where i need to first apply dataframe and then get weighted average as shown in the output calculation below. What is an efficient way in pyspark to do that? data = sc.parallelize([ [111,3,0.4], [111,4,0.3], [222,2,0.2], [222,3,0.2], [222,4,0.5]] ).toDF(['id', 'val','weight']) data.show() +---+---+------+ | id|val|weight| +---+---+------+ |111| 3| 0.4| |111| 4| 0.3| |222| 2| 0.2| |222| 3| 0.2| |222| 4| 0.5| +---+---+------+ Output: id weigthed_val 111 (3*0.4 + 4*0.3)/(0.4 +