Cast multiples columns in a DataFrame

问题

I'm on Databricks and I'm working on a classification problem. I have a DataFrame with 2000+ columns. I want to cast all the columns that will become features to double.

val array45 = data.columns drop(1)

for (element <- array45) {

data.withColumn(element, data(element).cast("double"))

}
 data.printSchema()

The cast to double is working but I'm not saving it in the DataFrame called Data. If I create a new DataFrame in the loop ; outside of the for loops my DataFrame won't exist. I do not want to use UDF.

How can I solve this ?

EDIT : Thanks both of you for your answer ! I don't know why but the answer of Shaido and Raul are taking a bunch of time to compute. It comes from Databricks, I think.

回答1:

you can simply write a function to cast a column to doubleType and use the function in select method.

The function:

import org.apache.spark.sql.types._
def func(column: Column) = column.cast(DoubleType)

And then use the function in select as

val array45 = data.columns.drop(1)
import org.apache.spark.sql.functions._
data.select(array45.map(name => func(col(name))): _*).show(false)

I hope the answer is helpful

回答2:

You can assign the new dataframe to a var at every iteration, thus keeping the most recent one at all times.

var finalData = data.cache()
for (element <- array45) {
  finalData = finalData.withColumn(element, finalData(element).cast("double"))
}

回答3:

Let me suggest use a foldLeft:

    val array45 = data.columns drop(1)

    val newData = array45.foldLeft(data)(
          (acc,c) =>
            acc.withColumn(c, data(c).cast("double")))

    newData.printSchema()

Hope this helps!

来源：https://stackoverflow.com/questions/46419530/cast-multiples-columns-in-a-dataframe

标签

scala

apache-spark

dataframe

casting

databricks