Calculating order statistics (percentile) using sparklyr

问题

A useful feature in dplyr is the ability to create calculated columns on the fly using mutate, one of those calculations is quantile, something that I used to be able to do with sparklyr with the function percentile, but for some reason it doesn't work anymore, here is a detailed example.

first creating a sample data set:

require(dplyr)
require(sparklyr)

# sc is a connection to spark

my_df <- data.frame(col1 = sample(1:100,30)) %>%  as_tibble()

my_df 

# # A tibble: 30 x 1
# col1
# <int>
# 1    91
# 2     1
# 3    15
# 4    42
# 5    36
# 6    18
# 7    35
# 8    98
# 9    60
# 10    24
# # ... with 20 more rows

Now calculating the 90th percentile

my_df %>%  mutate(pct_90 = quantile(col1, .9))

# # A tibble: 30 x 2
# col1 pct_90
# <int>  <dbl>
# 1    91   84.7
# 2     1   84.7
# 3    15   84.7
# 4    42   84.7
# 5    36   84.7
# 6    18   84.7
# 7    35   84.7
# 8    98   84.7
# 9    60   84.7
# 10    24   84.7
# # ... with 20 more rows

with spark

my_spark_df <- copy_to(sc, my_df, 'my_spark_df')

my_spark_df 


# # Source: spark<my_spark_df> [?? x 1]
# col1
# * <int>
# 1    91
# 2     1
# 3    15
# 4    42
# 5    36
# 6    18
# 7    35
# 8    98
# 9    60
# 10    24
# # ... with more rows
#

Now calculating the 90th percentile

my_spark_df %>%  mutate(pct_90 = percentile(col1, .9))


Error: org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'my_spark_df.`col1`' is not an aggregate function. Wrap '(percentile(my_spark_df.`col1`, CAST(0.9BD AS DOUBLE), 1L) AS `pct_90`)' in windowing function(s) or wrap 'my_spark_df.`col1`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [col1#6964, percentile(col1#6964, cast(0.9 as double), 1, 0, 0) AS pct_90#7030]
+- SubqueryAlias `my_spark_df`
   +- LogicalRDD [col1#6964], false

session info

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2018.03

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2  lubridate_1.7.4 DBI_1.0.0       scales_1.0.0    ggplot2_3.1.0   sparklyr_1.0.0  tibble_1.4.2   
[8] tidyr_0.8.1     dplyr_0.7.5    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       pillar_1.2.3     compiler_3.4.1   dbplyr_1.2.1     plyr_1.8.4       bindr_0.1.1     
 [7] r2d3_0.2.3       base64enc_0.1-3  tools_3.4.1      digest_0.6.15    jsonlite_1.5     gtable_0.3.0    
[13] pkgconfig_2.0.1  rlang_0.3.2      cli_1.0.0        rstudioapi_0.7   parallel_3.4.1   yaml_2.1.19     
[19] stringr_1.3.1    withr_2.1.2      httr_1.3.1       generics_0.0.2   htmlwidgets_1.3  rprojroot_1.3-2 
[25] grid_3.4.1       tidyselect_0.2.4 glue_1.2.0       forge_0.2.0      R6_2.2.2         purrr_0.2.5     
[31] magrittr_1.5     backports_1.1.2  htmltools_0.3.6  ellipsis_0.1.0   rsconnect_0.8.13 assertthat_0.2.0
[37] colorspace_1.4-1 labeling_0.3     config_0.3       utf8_1.1.4       stringi_1.2.3    openssl_1.0.1   
[43] lazyeval_0.2.1   munsell_0.5.0    crayon_1.3.4

来源：https://stackoverflow.com/questions/57890412/calculating-order-statistics-percentile-using-sparklyr

标签

apache-spark

dplyr

sparklyr