问题
I have a very large CSV
file which has been imported as a PySpark dataframe:df
. The dataframe contains many columns including column ireturn
. I want to compute the 0.99 and 0.01 percentile of this column and then add another column to the dataframe df
as new_col_99
and new_col_01
which contains the 0.99 and 0.01 percentile, respectively. I wrote the following codes which works for small dataframes but I get some errors when I apply it for my large dataframe.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("name of the file", inferSchema = True, header = True)
precentile_99 = df.selectExpr('percentile(val1, 0.99)').head(1)[0][0]
precentile_01 = df.selectExpr('percentile(val1, 0.01)').head(1)[0][0]
from pyspark.sql.functions import lit
df = df.withColumn("new_col_99", lit(precentile_99))
df = df.withColumn("new_col_01", lit(precentile_01))
As I said, it works for small dataframes but does not work for large ones.
I also replaced head
by collect
and it did not work as well. I get the error below:
Logging error ---
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:49850)
Traceback (most recent call last):...
Some update: I have tried the following codes as well:
percentile = df.approxQuantile('ireturn',[0.01,0.99],0.25)
df = df.withColumn("new_col_01", lit(percentile[0]))
df = df.withColumn("new_col_99", lit(percentile[1]))
The block of codes above takes about 15-20 min to run but the computaion is wrong (my data on the column ireturn
are less than 1 but it returns the 0.99 percentile as 6789....)
来源:https://stackoverflow.com/questions/54192113/how-to-add-a-column-to-a-pyspark-dataframe-which-contains-the-nth-quantile-of-an