Pyspark - Calculate RMSE between actuals and predictions for a groupby - AssertionError: all exprs should be Column

前端 未结 2 1864
自闭症患者
自闭症患者 2021-01-23 11:26

I have a function that calculates RMSE for the preds and actuals of an entire dataframe:

def calculate_rmse(df, actual_column, prediction_column):
    RMSE = F.u         


        
2条回答
  •  半阙折子戏
    2021-01-23 12:07

    I don't think you need a UDF for this - I think you should be able to take the difference between the two columns (df.withColumn('difference', col('true') - col('pred'))), then compute the square of that column (df.withColumn('squared_difference', pow(col('difference'), lit(2).astype(IntegerType()))), and compute the average of the column (df.withColumn('rmse', avg('squared_difference'))). Putting it all together with an example:

    from pyspark.sql import SparkSession
    from pyspark.sql import SQLContext
    import pyspark.sql.functions as F
    from pyspark.sql.types import IntegerType
    
    spark = SparkSession.builder.getOrCreate()
    
    sql_context = SQLContext(spark.sparkContext)
    
    df = sql_context.createDataFrame([(0.0, 1.0),
                                      (1.0, 2.0),
                                      (3.0, 5.0),
                                      (1.0, 8.0)], schema=['true', 'predicted'])
    
    df = df.withColumn('difference', F.col('true') - F.col('predicted'))
    df = df.withColumn('squared_difference', F.pow(F.col('difference'), F.lit(2).astype(IntegerType())))
    rmse = df.select(F.avg(F.col('squared_difference')).alias('rmse'))
    
    print(df.show())
    print(rmse.show())
    

    Output:

    +----+---------+----------+------------------+
    |true|predicted|difference|squared_difference|
    +----+---------+----------+------------------+
    | 0.0|      1.0|      -1.0|               1.0|
    | 1.0|      2.0|      -1.0|               1.0|
    | 3.0|      5.0|      -2.0|               4.0|
    | 1.0|      8.0|      -7.0|              49.0|
    +----+---------+----------+------------------+
    
    +-----+
    | rmse|
    +-----+
    |13.75|
    +-----+
    

    Hope this helps!

    Edit

    Sorry, I forgot to take the square root of the result - the last line becomes:

    rmse = df.select(F.sqrt(F.avg(F.col('squared_difference'))).alias('rmse'))
    

    and the output becomes:

    +------------------+
    |              rmse|
    +------------------+
    |3.7080992435478315|
    +------------------+
    

提交回复
热议问题