pyspark sql query : count distinct values with conditions

问题

I have a dataframe as below :

+-----------+------------+-------------+-----------+
| id_doctor | id_patient | consumption | type_drug |
+-----------+------------+-------------+-----------+
| d1        | p1         |        12.0 | bhd       |
| d1        | p2         |        10.0 | lsd       |
| d1        | p1         |         6.0 | bhd       |
| d1        | p1         |        14.0 | carboxyl  |
| d2        | p1         |        12.0 | bhd       |
| d2        | p1         |        13.0 | bhd       |
| d2        | p2         |        12.0 | lsd       |
| d2        | p1         |         6.0 | bhd       |
| d2        | p2         |        12.0 | bhd       |
+-----------+------------+-------------+-----------+

I want to count distinct patients that take bhd with a consumption < 16.0 for each doctor.

I tried the following query but it doesn't work :

dataframe.groupBy(col("id_doctor"))
         .agg(
         countDistinct(col("id_patient")).where(col("type_drug") == "bhd" & col("consumption") < 16.0)
         )

any help ?

thanks!

回答1:

Just use the where on your dataframe - this version delete the id_doctor where the count is 0 :

dataframe.where(
    col("type_drug") == "bhd" & col("consumption") < 16.0
).groupBy(
    col("id_doctor")
).agg(
    countDistinct(col("id_patient"))
)

Using this syntax, you can keep all the "doctors" :

dataframe.withColumn(
    "fg",
    F.when(
        (col("type_drug") == "bhd") 
        & (col("consumption") < 16.0),
        col("id_patient")
    )
).groupBy(
    col("id_doctor")
).agg(
    countDistinct(col("fg"))
)

回答2:

Another solution in PySpark without adding another column:

dataframe.groupBy('id_doctor').agg(F.countDistinct(F.when(col("type_drug") == "bhd" & col("consumption") < 16.0, col('id_doctor')).otherwise(None)))

回答3:

And solution without adding additional column (Scala)

dataframe
    .groupBy("id_doctor")
    .agg(
        countDistinct(when(col("type_drug")==="bhd" && col("consumption") < 16.0))
    )

来源：https://stackoverflow.com/questions/54004970/pyspark-sql-query-count-distinct-values-with-conditions

标签

sql

pyspark