问题
Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.
Edited: As per Suresh Request,
for column in media.columns:
if media.select(media[column]).distinct().count() == 1:
media = media.drop(media[column])
Here I assumed that if count is one, then it should be Nan. But I wanted to check whether that is Nan. And if there's any other inbuilt spark function, let me know.
回答1:
This is a function I have in my pipeline to remove null columns. Hope it helps!
# Function to drop the empty columns of a DF
def dropNullColumns(df):
# A set of all the null values you can encounter
null_set = {"none", "null" , "nan"}
# Iterate over each column in the DF
for col in df.columns:
# Get the distinct values of the column
unique_val = df.select(col).distinct().collect()[0][0]
# See whether the unique value is only none/nan or null
if str(unique_val).lower() in null_set:
print("Dropping " + col + " because of all null values.")
df = df.drop(col)
return(df)
回答2:
I tried my way.let me know if it works,
Say, I have a dataframe as below,
>>> df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2|null|
|null| 3|null|
| 5|null|null|
+----+----+----+
>>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns])
>>> df1.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 2| 2| 0|
+----+----+----+
>>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0]
>>> df = df.select(*nonNull_cols)
>>> df.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
|null| 3|
| 5|null|
+----+----+
回答3:
One of the indirect way to do so is
import pyspark.sql.functions as func
for col in sdf.columns:
if (sdf.filter(func.isnan(func.col(col)) == True).count() == sdf.select(func.col(col)).count()):
sdf = sdf.drop(col)
Update:
The above code drops columns with all nan. If you are looking for all nulls then
import pyspark.sql.functions as func
for col in sdf.columns:
if (sdf.filter(func.col(col).isNull()).count() == sdf.select(func.col(col)).count()):
sdf = sdf.drop(col)
Will update my answer if I find some optimal way :-)
回答4:
Or just
from pyspark.sql.functions import col
for c in df.columns:
if df.filter(col(c).isNotNull()).count() == 0:
df = df.drop(c)
回答5:
for me it worked in a bit different way than @Suresh answer:
nonNull_cols = [c for c in original_df.columns if original_df.filter(func.col(c).isNotNull()).count() > 0]
new_df = original_df.select(*nonNull_cols)
来源:https://stackoverflow.com/questions/45629781/drop-if-all-entries-in-a-spark-dataframes-specific-column-is-null