the following example describes how you can\'t calculate the number of distinct values without aggregating the rows using dplyr with sparklyr.
is there a work aroun
I want to link in this thread which answers this for sparklyr.
Using approx_count_distinct I think is the best solution. In my experience, dbplyr doesn't translate this function when using a window so it is better to write the SQL yourself.
mtcars_spk <- copy_to(sc, mtcars,"mtcars_spk",overwrite = TRUE)
mtcars_spk2 <- mtcars_spk %>%
dplyr::mutate(test = paste0(gear, " ",carb)) %>%
dplyr::mutate(discnt = sql("approx_count_distinct(test) OVER (PARTITION BY cyl)"))
This thread approaches the problem more generally and discusses CountDistinct v.s. approxCountDistinct