Drop if all entries in a spark dataframe's specific column is null

后端 未结 8 1218
轮回少年
轮回少年 2021-01-13 19:11

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

8条回答
  •  终归单人心
    2021-01-13 19:43

    One of the indirect way to do so is

    import pyspark.sql.functions as func
    
    for col in sdf.columns:
    if (sdf.filter(func.isnan(func.col(col)) == True).count() == sdf.select(func.col(col)).count()):
        sdf = sdf.drop(col) 
    

    Update:
    The above code drops columns with all nan. If you are looking for all nulls then

    import pyspark.sql.functions as func
    
    for col in sdf.columns:
    if (sdf.filter(func.col(col).isNull()).count() == sdf.select(func.col(col)).count()):
        sdf = sdf.drop(col)
    

    Will update my answer if I find some optimal way :-)

提交回复
热议问题