Dataframe sample in Apache spark | Scala

后端 未结 7 2105
北海茫月
北海茫月 2020-12-05 07:20

I\'m trying to take out samples from two dataframes wherein I need the ratio of count maintained. eg

df1.count() = 10
df2.count() = 1000

noOfSamples = 10
         


        
7条回答
  •  春和景丽
    2020-12-05 07:56

    I use this function for random sampling when exact number of records are desirable:

    def row_count_sample (df, row_count, with_replacement=False, random_seed=113170):
    
        ratio = 1.08 * float(row_count) / df.count()  # random-sample more as dataframe.sample() is not a guaranteed to give exact record count
                                                      # it could be more or less actual number of records returned by df.sample()
    
        if ratio>1.0:
            ratio = 1.0
    
        result_df = (df
                        .sample(with_replacement, ratio, random_seed)
                        .limit(row_count)                                   # since we oversampled, make exact row count here
                    )
    
        return result_df 
    

提交回复
热议问题