问题
A useful feature in dplyr is the ability to create calculated columns on the fly using mutate
, one of those calculations is quantile
, something that I used to be able to do with sparklyr
with the function percentile
, but for some reason it doesn't work anymore, here is a detailed example.
first creating a sample data set:
require(dplyr)
require(sparklyr)
# sc is a connection to spark
my_df <- data.frame(col1 = sample(1:100,30)) %>% as_tibble()
my_df
# # A tibble: 30 x 1
# col1
# <int>
# 1 91
# 2 1
# 3 15
# 4 42
# 5 36
# 6 18
# 7 35
# 8 98
# 9 60
# 10 24
# # ... with 20 more rows
Now calculating the 90th percentile
my_df %>% mutate(pct_90 = quantile(col1, .9))
# # A tibble: 30 x 2
# col1 pct_90
# <int> <dbl>
# 1 91 84.7
# 2 1 84.7
# 3 15 84.7
# 4 42 84.7
# 5 36 84.7
# 6 18 84.7
# 7 35 84.7
# 8 98 84.7
# 9 60 84.7
# 10 24 84.7
# # ... with 20 more rows
with spark
my_spark_df <- copy_to(sc, my_df, 'my_spark_df')
my_spark_df
# # Source: spark<my_spark_df> [?? x 1]
# col1
# * <int>
# 1 91
# 2 1
# 3 15
# 4 42
# 5 36
# 6 18
# 7 35
# 8 98
# 9 60
# 10 24
# # ... with more rows
#
Now calculating the 90th percentile
my_spark_df %>% mutate(pct_90 = percentile(col1, .9))
Error: org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'my_spark_df.`col1`' is not an aggregate function. Wrap '(percentile(my_spark_df.`col1`, CAST(0.9BD AS DOUBLE), 1L) AS `pct_90`)' in windowing function(s) or wrap 'my_spark_df.`col1`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [col1#6964, percentile(col1#6964, cast(0.9 as double), 1, 0, 0) AS pct_90#7030]
+- SubqueryAlias `my_spark_df`
+- LogicalRDD [col1#6964], false
session info
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2018.03
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 lubridate_1.7.4 DBI_1.0.0 scales_1.0.0 ggplot2_3.1.0 sparklyr_1.0.0 tibble_1.4.2
[8] tidyr_0.8.1 dplyr_0.7.5
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 pillar_1.2.3 compiler_3.4.1 dbplyr_1.2.1 plyr_1.8.4 bindr_0.1.1
[7] r2d3_0.2.3 base64enc_0.1-3 tools_3.4.1 digest_0.6.15 jsonlite_1.5 gtable_0.3.0
[13] pkgconfig_2.0.1 rlang_0.3.2 cli_1.0.0 rstudioapi_0.7 parallel_3.4.1 yaml_2.1.19
[19] stringr_1.3.1 withr_2.1.2 httr_1.3.1 generics_0.0.2 htmlwidgets_1.3 rprojroot_1.3-2
[25] grid_3.4.1 tidyselect_0.2.4 glue_1.2.0 forge_0.2.0 R6_2.2.2 purrr_0.2.5
[31] magrittr_1.5 backports_1.1.2 htmltools_0.3.6 ellipsis_0.1.0 rsconnect_0.8.13 assertthat_0.2.0
[37] colorspace_1.4-1 labeling_0.3 config_0.3 utf8_1.1.4 stringi_1.2.3 openssl_1.0.1
[43] lazyeval_0.2.1 munsell_0.5.0 crayon_1.3.4
来源:https://stackoverflow.com/questions/57890412/calculating-order-statistics-percentile-using-sparklyr