Spark data type guesser UDAF

Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess.

Does Spark have something like this already built-in? Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables.

How do you normally determine data types in Spark?

P.S. Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be a categorical variable or numerical.

P.P.S. Another use case is if you get an arbitrary data set (we get them quite often), and want to save as a Parquet table. Providing correct data types make parquet more space effiecient (and probably more query-time performant, e.g. better parquet bloom filters than just storing everything as string/varchar).

Does Spark have something like this already built-in?

Partially. There are some tools in Spark ecosystem which perform schema inference like spark-csv or pyspark-csv and category inference (categorical vs. numerical) like VectorIndexer.

So far so good. Problem is that schema inference has limited applicability, is not an easy task in general, can introduce hard to diagnose problems and can be quite expensive:

There are not so many formats which can be used with Spark and may require schema inference. In practice it is limited to different variants of CSV and Fixed Width Formatted data.
Depending on a data representation it can be impossible to determine correct data type or inferred type can lead to information loss:
- interpreting numeric data as float or double can lead to unacceptable loss of precision, especially if working with financial data
- date or number formats can differ based on a locale
- some common identifiers can look like numerics while having some internal structure which can lost in conversion
Automatic schema inference can mask different problems with input data and if it is not supported by additional tools which can highlight possible issues it can be dangerous. Moreover any mistakes during data loading and cleaning can be propagated through complete data processing pipeline.

Arguably we should develop good understanding of input data before we even start to think about possible representation and encoding.
Schema inference and / or category inference may require full data scan and / or large lookup tables. Both can be expensive or even not feasible on large datasets.

Edit:

It looks like schema inference capabilities on CSV files have been added directly to Spark SQL. See CSVInferSchema.

来源：https://stackoverflow.com/questions/32722132/spark-data-type-guesser-udaf

标签

apache-spark

machine-learning

Hive

bigdata

apache-spark-mllib