Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.
One of the indirect way to do so is
import pyspark.sql.functions as func
for col in sdf.columns:
if (sdf.filter(func.isnan(func.col(col)) == True).count() == sdf.select(func.col(col)).count()):
sdf = sdf.drop(col)
Update:
The above code drops columns with all nan. If you are looking for all nulls then
import pyspark.sql.functions as func
for col in sdf.columns:
if (sdf.filter(func.col(col).isNull()).count() == sdf.select(func.col(col)).count()):
sdf = sdf.drop(col)
Will update my answer if I find some optimal way :-)