I have 500 million rows in a spark dataframe. I\'m interested in using sample_n from dplyr because it will allow me to explicitly specify the samp
It is not. If you check the execution plan (optimizedPlan function as defined here) you'll see it is just a limit:
spark_data %>% sample_n(300) %>% optimizedPlan()
<jobj[168]>
org.apache.spark.sql.catalyst.plans.logical.GlobalLimit
GlobalLimit 300
+- LocalLimit 300
+- InMemoryRelation [country#151, continent#152, year#153, lifeExp#154, pop#155, gdpPercap#156], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `gapminder`
+- Scan ExistingRDD[country#151,continent#152,year#153,lifeExp#154,pop#155,gdpPercap#156]
This further confirmed by the show_query:
spark_data %>% sample_n(300) %>% show_query()
<SQL>
SELECT *
FROM (SELECT *
FROM `gapminder` TABLESAMPLE (300 rows) ) `hntcybtgns`
and visualized execution plan:
Finally if you check Spark source you'll see that this case is implemented with simple LIMIT:
case ctx: SampleByRowsContext =>
Limit(expression(ctx.expression), query)
I believe that this semantics has been inherited from Hive where equivalent query takes n first rows from each input split.
In practice getting a sample of an exact size is just very expensive, and you should avoid unless strictly necessary (same as large LIMITS).