How to handle null values when writing to parquet from Spark

眉间皱痕 提交于 2019-12-06 01:29:58

问题


Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support:

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

However it will be a long time before spark supports that new parquet feature - if ever. Here is the associated (closed - will not fix) JIRA:

https://issues.apache.org/jira/browse/SPARK-10943

So what are folks doing with regards to null column values today when writing out dataframe's to parquet ? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null - short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).


回答1:


You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns.

The problem is that null alone carries no type information at all

scala> spark.sql("SELECT null as comments").printSchema
root
 |-- comments: null (nullable = true)

As per comment by Michael Armbrust all you have to do is cast:

scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema
root
|-- comments: double (nullable = true)

and the result can be safely written to Parquet.




回答2:


I wrote a pyspark solution for this (df is a dataframe with columns of NullType):

# get dataframe schema
my_schema = list(df.schema)

null_cols = []

# iterate over schema list to filter for NullType columns
for st in my_schema:
    if str(st.dataType) == 'NullType':
        null_cols.append(st)

# cast null type columns to string (or whatever you'd like)
for ncol in null_cols:
    mycolname = str(ncol.name)
    df = df \
    .withColumn(mycolname, df[mycolname].cast('string'))


来源:https://stackoverflow.com/questions/50160682/how-to-handle-null-values-when-writing-to-parquet-from-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!