Generate data by using existing dataset as the base dataset

前端 未结 2 887
栀梦
栀梦 2021-01-16 12:16

I have a dataset consisting of 100k unique data records, to benchmark the code, I need to test on data with 5 million unique records, I don\'t want to generate random data.

2条回答
  •  Happy的楠姐
    2021-01-16 12:50

    If you just want to generate data only in scala, try in this way.

    val r = new scala.util.Random   //create scala random object
    val new_val = r.nextFloat() // for generating next random float between 0 to 1 for every call
    

    And add this new_val to maximum value of latitude in your data. Unique latitude anyway makes pair unique.

    You can try this option with Spark with Scala.

    val latLongDF = ss.read.option("header", true).option("delimiter", ",").format("csv").load(mypath)   // loaded your sample data in your question as Dataframe
    +---------+----------+----+-----+
    | latitude| longitude|step|count|
    +---------+----------+----+-----+
    |25.696395|-80.297496|   1|    1|
    |25.699544|-80.297055|   1|    1|
    |25.698612|-80.292015|   1|    1|
    
    
    val max_lat = latLongDF.select(max("latitude")).first.get(0).toString().toDouble // got max latitude value
    
    val r = new scala.util.Random // scala random object to get random numbers
    
    val nextLat = udf(() => (28.355484 + 0.000001 + r.nextFloat()).toFloat) // udf to give random latitude more than the existing maximum latitude
    

    In above line toFloat rounds to float which can cause duplicate values. Remove this to get complete random values if you are fine with more decimal values(more than 6) in your latitudes. Or use same method on longitude also to get better uniqueness.

    val new_df = latLongDF.withColumn("new_lat", nextLat()).select(col("new_lat").alias("latitude"),$"longitude",$"step",$"count").union(latLongDF) // creating new dataframe and Union with existing dataframe 
    

    New generated data sample.

    +----------+----------+----+-----+
    |latitude| longitude|step|count|
    +----------+----------+----+-----+
    | 28.446129|-80.297496|   1|    1|
    | 28.494934|-80.297055|   1|    1|
    | 28.605234|-80.292015|   1|    1|
    | 28.866316|-80.341607|   1|    1|
    

提交回复
热议问题