Spark toDF() / createDataFrame() type inference doesn't work as expected

问题

df = sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
df.printSchema()

produces

root
|-- should_be_int: string (nullable = true)
|-- should_be_str: string (nullable = true)

Notice should_be_int has string datatype, according to documentation: https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection

Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.

Schema inference works as expected when reading delimited files like

spark.read.format('csv').option('inferSchema', True)...

but not when using toDF() / createDataFrame() API calls.

Spark 2.2.

Update: more verbose explanation, to explain why '1' in example above is in single quotes (strings and not 1 as of type 'int').

'1' is of type 'str'. This was made specifically to demonstrate my point. As I said in the jira description, I wanted to have the same schema inference works as expected when reading delimited files (in the old good spark-csv spark module).

As an example, we read in fixed-width files using sc.binaryRecords(hdfsFile, recordLength) and then after rdd.map() basically get a very wide modeling dataset which has all elements / "columns" strings.

We want to engage the same spark-csv type of schema inference so Spark maps strings by analyzing all strings to come up with actual data types.

We had other scenarios when we want toDF() and/or createDataFrame() API calls to engage the same schema inference by reading whole dataset and see, like in example above, '1', '2', '3' "least common" type is type 'int' - again, exactly what spark-csv logic does. Is this possible in Spark?

Also you can think of this as of Pandas' infer_dtype() call:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.types.infer_dtype.html?highlight=infer#pandas.api.types.infer_dtype

ps. As I believe this is a bug (at least behavior doesn't match documentation), created Spark jira too - https://issues.apache.org/jira/browse/SPARK-22505

pps. This is how it works with spark-csv schema inference:

$ cat 123.txt
should_be_int,should_be_str
1,a
2,b
3,c

回答1:

In Spark 2.3+ it is possible to solve this using following methods:

It is already available in Spark 2.2 but only for Scala API.

More on this read - SPARK-15463, SPARK-22112, SPARK-22505.

来源：https://stackoverflow.com/questions/47259621/spark-todf-createdataframe-type-inference-doesnt-work-as-expected

标签

apache-spark

pyspark

schema

spark-dataframe

type-inference