Dataframe sample in Apache spark | Scala

后端 未结 7 2111
北海茫月
北海茫月 2020-12-05 07:20

I\'m trying to take out samples from two dataframes wherein I need the ratio of count maintained. eg

df1.count() = 10
df2.count() = 1000

noOfSamples = 10
         


        
7条回答
  •  粉色の甜心
    2020-12-05 07:46

    I too find lack of sample by count functionality disturbing. If you are not picky about creating a temp view I find the code below useful (df is your dataframe, count is sample size):

    val tableName = s"table_to_sample_${System.currentTimeMillis}"
    df.createOrReplaceTempView(tableName)
    val sampled = sqlContext.sql(s"select *, rand() as random from ${tableName} order by random limit ${count}")
    sqlContext.dropTempTable(tableName)
    sampled.drop("random")
    

    It returns an exact count as long as your current row count is as large as your sample size.

提交回复
热议问题