i have a dataframe with a parquet file and I have to add a new column with some random data, but I need that random data different each other. This is my actual code and the
You can make use of monotonically_increasing_id
to generate random values.
Then you can define a UDF to append any string to it after casting it to String as monotonically_increasing_id
returns Long by default.
scala> var df = Seq(("Ron"), ("John"), ("Steve"), ("Brawn"), ("Rock"), ("Rick")).toDF("names")
+-----+
|names|
+-----+
| Ron|
| John|
|Steve|
|Brawn|
| Rock|
| Rick|
+-----+
scala> val appendD = spark.sqlContext.udf.register("appendD", (s: String) => s.concat("D"))
scala> df = df.withColumn("ID",monotonically_increasing_id).selectExpr("names","cast(ID as String) ID").withColumn("ID",appendD($"ID"))
+-----+---+
|names| ID|
+-----+---+
| Ron| 0D|
| John| 1D|
|Steve| 2D|
|Brawn| 3D|
| Rock| 4D|
| Rick| 5D|
+-----+---+
Spark >= 2.3
It is possible to disable some optimizations using asNondeterministic
method:
import org.apache.spark.sql.expressions.UserDefinedFunction
val f: UserDefinedFunction = ???
val fNonDeterministic: UserDefinedFunction = f.asNondeterministic
Please make sure you understand the guarantees before using this option.
Spark < 2.3
Function which is passed to udf should be deterministic (with possible exception of SPARK-20586) and nullary functions calls can be replaced by constants. If you want to generate random numbers use on of the built-in functions:
and transform the output to obtain required distribution for example:
(rand * Integer.MAX_VALUE).cast("bigint").cast("string")