Is sample_n really a random sample when used with sparklyr?

后端 未结 1 1081
逝去的感伤
逝去的感伤 2020-12-19 03:42

I have 500 million rows in a spark dataframe. I\'m interested in using sample_n from dplyr because it will allow me to explicitly specify the samp

相关标签:
1条回答
  • 2020-12-19 04:35

    It is not. If you check the execution plan (optimizedPlan function as defined here) you'll see it is just a limit:

    spark_data %>% sample_n(300) %>% optimizedPlan()
    
    <jobj[168]>
      org.apache.spark.sql.catalyst.plans.logical.GlobalLimit
      GlobalLimit 300
    +- LocalLimit 300
       +- InMemoryRelation [country#151, continent#152, year#153, lifeExp#154, pop#155, gdpPercap#156], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `gapminder`
             +- Scan ExistingRDD[country#151,continent#152,year#153,lifeExp#154,pop#155,gdpPercap#156] 
    

    This further confirmed by the show_query:

    spark_data %>% sample_n(300) %>% show_query()
    
    <SQL>
    SELECT *
    FROM (SELECT *
    FROM `gapminder` TABLESAMPLE (300 rows) ) `hntcybtgns`
    

    and visualized execution plan:

    Finally if you check Spark source you'll see that this case is implemented with simple LIMIT:

    case ctx: SampleByRowsContext =>
      Limit(expression(ctx.expression), query)
    

    I believe that this semantics has been inherited from Hive where equivalent query takes n first rows from each input split.

    In practice getting a sample of an exact size is just very expensive, and you should avoid unless strictly necessary (same as large LIMITS).

    0 讨论(0)
提交回复
热议问题