inferSchema in spark-csv package

隐身守侯 提交于 2019-12-04 05:57:24
zero323

2015-07-30

The latest version is actually 1.1.0, but it doesn't really matter since it looks like inferSchema is not included in the latest release.

2015-08-17

The latest version of the package is now 1.2.0 (published on 2015-08-06) and schema inference works as expected:

scala> df.printSchema
root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- years_of_experience: integer (nullable = true)
 |-- DOB: string (nullable = true)

Regarding automatic date parsing I doubt it will ever happen, or at least not without providing additional metadata.

Even if all fields follow some date-like format it is impossible to say if a given field should be interpreted as a date. So it is either lack of out automatic date inference or spreadsheet like mess. Not to mention issues with timezones for example.

Finally you can easily parse date string manually:

sqlContext
  .sql("SELECT *, DATE(dob) as dob_d  FROM df")
  .drop("DOB")
  .printSchema

root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- years_of_experience: integer (nullable = true)
 |-- dob_d: date (nullable = true)

so it is really not a serious issue.

2017-12-20:

Built-in csv parser available since Spark 2.0 supports schema inference for dates and timestamp - it uses two options:

  • timestampFormat with default yyyy-MM-dd'T'HH:mm:ss.SSSXXX
  • dateFormat with default yyyy-MM-dd

See also How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!