Column name with dot spark

后端 未结 2 1256
天命终不由人
天命终不由人 2020-12-10 10:48

I am trying to take columns from a DataFrame and convert it to an RDD[Vector].

The problem is that I have columns with a \"dot\" in their n

相关标签:
2条回答
  • 2020-12-10 11:11

    The problem here is VectorAssembler implementation, not the columns per se. You can for example skip the header:

    val df = spark.read.format("csv")
      .options(Map("inferSchema" -> "true", "comment" -> "\""))
      .load(path)
    
    new VectorAssembler()
      .setInputCols(df.columns)
      .setOutputCol("vs")
      .transform(df)
    

    or rename columns before passing to VectorAssembler:

    val renamed =  df.toDF(df.columns.map(_.replace(".", "_")): _*)
    
    new VectorAssembler()
      .setInputCols(renamed.columns)
      .setOutputCol("vs")
      .transform(renamed)
    

    Finally the best approach is to provide schema explicitly:

    import org.apache.spark.sql.types._
    
    val schema = StructType((0 until 4).map(i => StructField(s"_$i", DoubleType)))
    
    val dfExplicit = spark.read.format("csv")
      .options(Map("header" -> "true"))
      .schema(schema)
      .load(path)
    
    new VectorAssembler()
      .setInputCols(dfExplicit.columns)
      .setOutputCol("vs")
      .transform(dfExplicit)
    
    0 讨论(0)
  • 2020-12-10 11:27

    If your problem is the .(dot) in the column name, you could use `(backticks) to enclose the column name.

    df.select("`col0.1`")

    0 讨论(0)
提交回复
热议问题