Converting query from SQL to pyspark

问题

I am trying to convert the following SQL query into pyspark:

SELECT COUNT( CASE WHEN COALESCE(data.pred,0) != 0 AND COALESCE(data.val,0) != 0 
    AND (ABS(COALESCE(data.pred,0) - COALESCE(data.val,0)) / COALESCE(data.val,0)) > 0.1
    THEN data.pred END) / COUNT(*) AS Result

The code I have in PySpark right now is this:

Result = data.select(
    count(
        (coalesce(data["pred"], lit(0)) != 0) & 
        (coalesce(data["val"], lit(0)) != 0) & 
        (abs(
             coalesce(data["pred"], lit(0)) - 
             coalesce(data["val"], lit(0))
            ) / coalesce(data["val"], lit(0)) > 0.1
        )
    )
)
aux_2 = aux_1.select(aux_1.column_name.cast("float"))

aux_3 = aux_2.head()[0]

Deviation = (aux_3 / data.count())*100

However, this is simply returning the number of rows in the "data" dataframe, and I know this isn't correct. I am very new at PySpark, can anyone help me solve this?

回答1:

You need to collect the result into an integer, and then divide the numbers in Python:

Result = data.filter(
    (coalesce(data["pred"], lit(0)) != 0) & 
    (coalesce(data["val"], lit(0)) != 0) & 
    (abs(
         coalesce(data["pred"], lit(0)) - 
         coalesce(data["val"], lit(0))
        ) / coalesce(data["val"], lit(0)) > 0.1
    )
).count() / data.count()

来源：https://stackoverflow.com/questions/65478361/converting-query-from-sql-to-pyspark

标签

sql

apache-spark

pyspark

apache-spark-sql

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!