Drop if all entries in a spark dataframe's specific column is null

后端未结

关注

 8  1218

轮回少年 2021-01-13 19:11

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

8条回答

终归单人心 (楼主)

2021-01-13 19:43

One of the indirect way to do so is

import pyspark.sql.functions as func

for col in sdf.columns:
if (sdf.filter(func.isnan(func.col(col)) == True).count() == sdf.select(func.col(col)).count()):
    sdf = sdf.drop(col)

Update:
The above code drops columns with all nan. If you are looking for all nulls then

import pyspark.sql.functions as func

for col in sdf.columns:
if (sdf.filter(func.col(col).isNull()).count() == sdf.select(func.col(col)).count()):
    sdf = sdf.drop(col)

Will update my answer if I find some optimal way :-)

0 讨论(0)

查看其它8个回答