Better way to convert a string field into timestamp in Spark

前端 未结 7 845
独厮守ぢ
独厮守ぢ 2020-11-27 16:29

I have a CSV in which a field is datetime in a specific format. I cannot import it directly in my Dataframe because it needs to be a timestamp. So I import it as string and

7条回答
  •  独厮守ぢ
    2020-11-27 17:08

    I would like to move the getTimeStamp method wrote by you into rdd's mapPartitions and reuse GenericMutableRow among rows in an iterator:

    val strRdd = sc.textFile("hdfs://path/to/cvs-file")
    val rowRdd: RDD[Row] = strRdd.map(_.split('\t')).mapPartitions { iter =>
      new Iterator[Row] {
        val row = new GenericMutableRow(4)
        var current: Array[String] = _
    
        def hasNext = iter.hasNext
        def next() = {
          current = iter.next()
          row(0) = current(0)
          row(1) = current(1)
          row(2) = current(2)
    
          val ts = getTimestamp(current(3))
          if(ts != null) {
            row.update(3, ts)
          } else {
            row.setNullAt(3)
          }
          row
        }
      }
    }
    

    And you should still use schema to generate a DataFrame

    val df = sqlContext.createDataFrame(rowRdd, tableSchema)
    

    The usage of GenericMutableRow inside an iterator implementation could be find in Aggregate Operator, InMemoryColumnarTableScan, ParquetTableOperations etc.

提交回复
热议问题