Check for empty row within spark dataframe?

梦想与她 提交于 2021-02-19 07:55:06

问题


Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row.

So i am running the following and for some reason it gives me an OK output:

check_empty = lambda row : not any([False if k is None else True for k in row])
check_empty_udf = sf.udf(check_empty, BooleanType())
df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show()

I am missing something within the filter function or we can't extract empty rows from dataframes.


回答1:


You could use df.dropna() to drop empty rows and then compare the counts.

Something like

df_clean = df.dropna()
num_empty_rows = df.count() - df_clean.count()



回答2:


You could use an inbuilt option for dealing with such scenarios.

val df = spark.read
     .format("csv")
     .option("header", "true")
     .option("mode", "DROPMALFORMED") // Drop empty/malformed rows
     .load("hdfs:///path/file.csv")

Check this reference - https://docs.databricks.com/spark/latest/data-sources/read-csv.html#reading-files



来源:https://stackoverflow.com/questions/53376449/check-for-empty-row-within-spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!