How to split column of vectors into two columns?

问题

I use PySpark.

Spark ML's Random Forest output DataFrame has a column "probability" which is a vector with two values. I just want to add two columns to the output DataFrame, "prob1" and "prob2", which correspond to the first and second values in the vector.

I've tried the following:

output2 = output.withColumn('prob1', output.map(lambda r: r['probability'][0]))

but I get the error that 'col should be Column'.

Any suggestions on how to transform a column of vectors into columns of its values?

回答1:

I figured out the problem with the suggestion above. In pyspark, "dense vectors are simply represented as NumPy array objects", so the issue is with python and numpy types. Need to add .item() to cast a numpy.float64 to a python float.

The following code works:

split1_udf = udf(lambda value: value[0].item(), FloatType())
split2_udf = udf(lambda value: value[1].item(), FloatType())

output2 = randomforestoutput.select(split1_udf('probability').alias('c1'), split2_udf('probability').alias('c2'))

Or to append these columns to the original dataframe:

randomforestoutput.withColumn('c1', split1_udf('probability')).withColumn('c2', split2_udf('probability'))

回答2:

Got the same problem, below is the code adjusted for the situation when you have n-length vector.

splits = [udf(lambda value: value[i].item(), FloatType()) for i in range(n)]
out =  tstDF.select(*[s('features').alias("Column"+str(i)) for i, s in enumerate(splits)])

回答3:

You may want to use one UDF to extract the first value and another to extract the second. You can then use the UDF with a select call on the output of the random forrest data frame. Example:

from pyspark.sql.functions import udf, col

split1_udf = udf(lambda value: value[0], FloatType())
split2_udf = udf(lambda value: value[1], FloatType())
output2 = randomForrestOutput.select(split1_udf(col("probability")).alias("c1"),
                                     split2_udf(col("probability")).alias("c2"))

This should give you a dataframe output2 which has columns c1 and c2 corresponding to the first and second values in the list stored in the column probability.

来源：https://stackoverflow.com/questions/37311688/how-to-split-column-of-vectors-into-two-columns

标签

apache-spark

pyspark

apache-spark-ml