Spark toDF() / createDataFrame() type inference doesn't work as expected

社会主义新天地 提交于 2019-12-08 14:00:06

问题


df = sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
df.printSchema()

produces

root
|-- should_be_int: string (nullable = true)
|-- should_be_str: string (nullable = true)

Notice should_be_int has string datatype, according to documentation: https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection

Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.

Schema inference works as expected when reading delimited files like

spark.read.format('csv').option('inferSchema', True)...

but not when using toDF() / createDataFrame() API calls.

Spark 2.2.

Update: more verbose explanation, to explain why '1' in example above is in single quotes (strings and not 1 as of type 'int').

'1' is of type 'str'. This was made specifically to demonstrate my point. As I said in the jira description, I wanted to have the same schema inference works as expected when reading delimited files (in the old good spark-csv spark module).

As an example, we read in fixed-width files using sc.binaryRecords(hdfsFile, recordLength) and then after rdd.map() basically get a very wide modeling dataset which has all elements / "columns" strings.

We want to engage the same spark-csv type of schema inference so Spark maps strings by analyzing all strings to come up with actual data types.

We had other scenarios when we want toDF() and/or createDataFrame() API calls to engage the same schema inference by reading whole dataset and see, like in example above, '1', '2', '3' "least common" type is type 'int' - again, exactly what spark-csv logic does. Is this possible in Spark?

Also you can think of this as of Pandas' infer_dtype() call:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.types.infer_dtype.html?highlight=infer#pandas.api.types.infer_dtype

ps. As I believe this is a bug (at least behavior doesn't match documentation), created Spark jira too - https://issues.apache.org/jira/browse/SPARK-22505

pps. This is how it works with spark-csv schema inference:

$ cat 123.txt
should_be_int,should_be_str
1,a
2,b
3,c


回答1:


In Spark 2.3+ it is possible to solve this using following methods:

It is already available in Spark 2.2 but only for Scala API.

More on this read - SPARK-15463, SPARK-22112, SPARK-22505.



来源:https://stackoverflow.com/questions/47259621/spark-todf-createdataframe-type-inference-doesnt-work-as-expected

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!