About how to add a new column to an existing DataFrame with random values in Scala

前端 未结 2 1813
再見小時候
再見小時候 2020-11-27 08:51

i have a dataframe with a parquet file and I have to add a new column with some random data, but I need that random data different each other. This is my actual code and the

相关标签:
2条回答
  • 2020-11-27 09:05

    You can make use of monotonically_increasing_id to generate random values.

    Then you can define a UDF to append any string to it after casting it to String as monotonically_increasing_id returns Long by default.

    scala> var df = Seq(("Ron"), ("John"), ("Steve"), ("Brawn"), ("Rock"), ("Rick")).toDF("names")
    +-----+
    |names|
    +-----+
    |  Ron|
    | John|
    |Steve|
    |Brawn|
    | Rock|
    | Rick|
    +-----+
    
    scala> val appendD = spark.sqlContext.udf.register("appendD", (s: String) => s.concat("D"))
    
    scala> df = df.withColumn("ID",monotonically_increasing_id).selectExpr("names","cast(ID as String) ID").withColumn("ID",appendD($"ID"))
    +-----+---+
    |names| ID|
    +-----+---+
    |  Ron| 0D|
    | John| 1D|
    |Steve| 2D|
    |Brawn| 3D|
    | Rock| 4D|
    | Rick| 5D|
    +-----+---+
    
    0 讨论(0)
  • 2020-11-27 09:14

    Spark >= 2.3

    It is possible to disable some optimizations using asNondeterministic method:

    import org.apache.spark.sql.expressions.UserDefinedFunction
    
    val f: UserDefinedFunction = ???
    val fNonDeterministic: UserDefinedFunction = f.asNondeterministic
    

    Please make sure you understand the guarantees before using this option.

    Spark < 2.3

    Function which is passed to udf should be deterministic (with possible exception of SPARK-20586) and nullary functions calls can be replaced by constants. If you want to generate random numbers use on of the built-in functions:

    • rand - Generate a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
    • randn - Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.

    and transform the output to obtain required distribution for example:

    (rand * Integer.MAX_VALUE).cast("bigint").cast("string")
    
    0 讨论(0)
提交回复
热议问题