Spark data type guesser UDAF

徘徊边缘 提交于 2019-11-29 12:54:31

Does Spark have something like this already built-in?

Partially. There are some tools in Spark ecosystem which perform schema inference like spark-csv or pyspark-csv and category inference (categorical vs. numerical) like VectorIndexer.

So far so good. Problem is that schema inference has limited applicability, is not an easy task in general, can introduce hard to diagnose problems and can be quite expensive:

  1. There are not so many formats which can be used with Spark and may require schema inference. In practice it is limited to different variants of CSV and Fixed Width Formatted data.
  2. Depending on a data representation it can be impossible to determine correct data type or inferred type can lead to information loss:

    • interpreting numeric data as float or double can lead to unacceptable loss of precision, especially if working with financial data
    • date or number formats can differ based on a locale
    • some common identifiers can look like numerics while having some internal structure which can lost in conversion
  3. Automatic schema inference can mask different problems with input data and if it is not supported by additional tools which can highlight possible issues it can be dangerous. Moreover any mistakes during data loading and cleaning can be propagated through complete data processing pipeline.

    Arguably we should develop good understanding of input data before we even start to think about possible representation and encoding.

  4. Schema inference and / or category inference may require full data scan and / or large lookup tables. Both can be expensive or even not feasible on large datasets.

Edit:

It looks like schema inference capabilities on CSV files have been added directly to Spark SQL. See CSVInferSchema.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!