Q1. I am trying to get a simple random sample out of a Spark dataframe (13 rows) using the sample function with parameters withReplacement: false, fraction: 0.6 but it gives
The RDD API includes takeSample
, which will return a "sample of specified size in an array". It works by calling sample
until it gets a sample size greater than the requested one, then randomly taking the specified number from that. The code comments that it shouldn't have to iterate often due to a bias toward large sample sizes.