Reading CSV into a Spark Dataframe with timestamp and date types

后端 未结 2 1206
南笙
南笙 2021-02-18 16:42

It\'s CDH with Spark 1.6.

I am trying to import this Hypothetical CSV into a apache Spark DataFrame:

$ hadoop fs -cat test.csv
a,b,c,201         


        
相关标签:
2条回答
  • 2021-02-18 16:50

    It's not really elegant but you can convert from timestamp to date like this (check last line):

    val textData = sqlContext.read.format("com.databricks.spark.csv")
        .option("header", "false")
        .option("delimiter", ",")
        .option("dateFormat", "yyyy-MM-dd")
        .option("inferSchema", "true")
        .option("nullValue", "null")
        .load("test.csv")
        .withColumn("C4", expr("""to_date(C4)"""))
    
    0 讨论(0)
  • 2021-02-18 16:55

    With a infer option for non-trivial cases it will probably not return the expected result. As you can see in InferSchema.scala:

    if (field == null || field.isEmpty || field == nullValue) {
      typeSoFar
    } else {
      typeSoFar match {
        case NullType => tryParseInteger(field)
        case IntegerType => tryParseInteger(field)
        case LongType => tryParseLong(field)
        case DoubleType => tryParseDouble(field)
        case TimestampType => tryParseTimestamp(field)
        case BooleanType => tryParseBoolean(field)
        case StringType => StringType
        case other: DataType =>
          throw new UnsupportedOperationException(s"Unexpected data type $other")
    

    It will only try to match each column with a timestamp type, not a date type, so the "out of the box solution" for this case is not possible. But with my experience the "easier" solution, is directly define the schema with the needed type, it will avoid the infer option set a type that only matches for the RDD evaluated not the entire data. Your final schema is an efficient solution.

    0 讨论(0)
提交回复
热议问题