How do simple random sampling and dataframe SAMPLE function work in Apache Spark (Scala)?

前端 未结 2 1075
说谎
说谎 2020-12-06 21:28

Q1. I am trying to get a simple random sample out of a Spark dataframe (13 rows) using the sample function with parameters withReplacement: false, fraction: 0.6 but it gives

2条回答
  •  一生所求
    2020-12-06 22:03

    The RDD API includes takeSample, which will return a "sample of specified size in an array". It works by calling sample until it gets a sample size greater than the requested one, then randomly taking the specified number from that. The code comments that it shouldn't have to iterate often due to a bias toward large sample sizes.

提交回复
热议问题