Cast multiples columns in a DataFrame

孤者浪人 提交于 2020-01-06 06:11:44

问题


I'm on Databricks and I'm working on a classification problem. I have a DataFrame with 2000+ columns. I want to cast all the columns that will become features to double.

val array45 = data.columns drop(1)

for (element <- array45) {

data.withColumn(element, data(element).cast("double"))

}
 data.printSchema()

The cast to double is working but I'm not saving it in the DataFrame called Data. If I create a new DataFrame in the loop ; outside of the for loops my DataFrame won't exist. I do not want to use UDF.

How can I solve this ?

EDIT : Thanks both of you for your answer ! I don't know why but the answer of Shaido and Raul are taking a bunch of time to compute. It comes from Databricks, I think.


回答1:


you can simply write a function to cast a column to doubleType and use the function in select method.

The function:

import org.apache.spark.sql.types._
def func(column: Column) = column.cast(DoubleType)

And then use the function in select as

val array45 = data.columns.drop(1)
import org.apache.spark.sql.functions._
data.select(array45.map(name => func(col(name))): _*).show(false)

I hope the answer is helpful




回答2:


You can assign the new dataframe to a var at every iteration, thus keeping the most recent one at all times.

var finalData = data.cache()
for (element <- array45) {
  finalData = finalData.withColumn(element, finalData(element).cast("double"))
}



回答3:


Let me suggest use a foldLeft:

    val array45 = data.columns drop(1)

    val newData = array45.foldLeft(data)(
          (acc,c) =>
            acc.withColumn(c, data(c).cast("double")))

    newData.printSchema()

Hope this helps!



来源:https://stackoverflow.com/questions/46419530/cast-multiples-columns-in-a-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!