Scala & Spark: Cast multiple columns at once

后端 未结 4 883
小鲜肉
小鲜肉 2020-12-30 15:05

Since the VectorAssembler is crashing, if a passed column has any other type than NumericType or BooleanType and I\'m dealing with a lot of T

相关标签:
4条回答
  • 2020-12-30 15:22

    Based on the comments (thanks!) I came up with the following code (no error handling implemented):

    def castAllTypedColumnsTo(df: DataFrame, 
       sourceType: DataType, targetType: DataType) : DataFrame = {
    
          val columnsToBeCasted = df.schema
             .filter(s => s.dataType == sourceType)
    
          //if(columnsToBeCasted.length > 0) {
          //   println(s"Found ${columnsToBeCasted.length} columns " +
          //      s"(${columnsToBeCasted.map(s => s.name).mkString(",")})" +
          //      s" - casting to ${targetType.typeName.capitalize}Type")
          //}
    
          columnsToBeCasted.foldLeft(df){(foldedDf, col) => 
             castColumnTo(foldedDf, col.name, LongType)}
    }
    

    Thanks for the inspiring comments. foldLeft (explained here and here) saves a for loop to iterate over a var dataframe.

    0 讨论(0)
  • 2020-12-30 15:27
    FastDf = (spark.read.csv("Something.csv", header = False, mode="DRPOPFORMED"))
    FastDf.OldTypes = [feald.dataType for feald in FastDf.schema.fields]
    FastDf.NewTypes = [StringType(), FloatType(), FloatType(), IntegerType()]
    FastDf.OldColnames = FastDf.columns
    FastDf.NewColnames = ['S_tring', 'F_loat', 'F_loat2', 'I_nteger']
    FastDfSchema = FastDf.select(*
                                 (FastDf[colnumber]
                                  .cast(FastDf.NewTypes[colnumber])
                                  .alias(FastDf.NewColnames[colnumber]) 
                                      for colnumber in range(len(FastDf.NewTypes)
                                                    )
                                 )
                                )
    

    I know it is in pyspark but the logic might be handy.

    0 讨论(0)
  • 2020-12-30 15:35

    I am translating scala program for python. I found smart answer to your problem. The column are named V1 - V28, Time, Amount, Class. (I am not Scala pro) The solution looks like this.

    // cast all the column to Double type.
    val df = raw.select(((1 to 28).map(i => "V" + i) ++ Array("Time", "Amount", "Class")).map(s => col(s).cast("Double")): _*)
    

    The link: https://github.com/intel-analytics/analytics-zoo/blob/master/apps/fraudDetection/Fraud%20Detction.ipynb

    0 讨论(0)
  • 2020-12-30 15:38

    casting of all columns with idiomatic approach in scala

    def castAllTypedColumnsTo(df: DataFrame, sourceType: DataType, targetType: DataType) = {
    df.schema.filter(_.dataType == sourceType).foldLeft(df) {
        case (acc, col) => acc.withColumn(col.name, df(col.name).cast(targetType))
     }
    }
    
    0 讨论(0)
提交回复
热议问题