How to select an exact number of random rows from DataFrame

狂风中的少年 提交于 2019-12-25 09:18:07

问题


How can I select an exact number of random rows from a DataFrame efficiently? The data contains an index column that can be used. If I have to use maximum size, what is more efficient, count() or max() on the index column?


回答1:


A possible approach is to calculate the number of rows using .count(), then use sample() from python's random library to generate a random sequence of arbitrary length from this range. Lastly use the resulting list of numbers vals to subset your index column.

import random 
def sampler(df, col, records):

  # Calculate number of rows
  colmax = df.count()

  # Create random sample from range
  vals = random.sample(range(1, colmax), records)

  # Use 'vals' to filter DataFrame using 'isin'
  return df.filter(df[col].isin(vals))

Example:

df = sc.parallelize([(1,1),(2,1),
                     (3,1),(4,0),
                     (5,0),(6,1),
                     (7,1),(8,0),
                     (9,0),(10,1)]).toDF(["a","b"])

sampler(df,"a",3).show()
+---+---+
|  a|  b|
+---+---+
|  3|  1|
|  4|  0|
|  6|  1|
+---+---+


来源:https://stackoverflow.com/questions/40454334/how-to-select-an-exact-number-of-random-rows-from-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!