I am trying to use the group_by() and mutate() functions in sparklyr to concatenate rows in a group.
Here is a simple example that I think should work but doesn\'t:<
Spark sql doesn't like it if you use aggregate functions without aggregating, hence the reason why this works in dplyr with an ordinary dataframe but not in a SparkDataFrame- sparklyr translates your commands to an sql statement. You can observe this going wrong if you look at the second bit in the error message:
== SQL ==
SELECT `id`, `x`, CONCAT_WS(' ', `y`, ' ' AS "collapse") AS `y`
paste gets translated to CONCAT_WS. concat however would paste columns together.
A better equivalent would be collect_list and collect_set, but they produce list outputs.
But you can build on that:
If you do not want to have the same row replicated in your result you can use summarise, collect_list, and paste:
res <- d_sdf %>%
group_by(id, x) %>%
summarise( yconcat =paste(collect_list(y)))
result:
Source: lazy query [?? x 3]
Database: spark connection master=local[8] app=sparklyr local=TRUE
Grouped by: id
id x y
<chr> <chr> <chr>
1 1 201 End
2 2 201 Other End
3 1 200 This That
4 2 200 The
you can join this back onto your original data if you do want to have your rows replicated:
d_sdf %>% left_join(res)
result:
Source: lazy query [?? x 4]
Database: spark connection master=local[8] app=sparklyr local=TRUE
id x y yconcat
<chr> <chr> <chr> <chr>
1 1 200 This This That
2 1 200 That This That
3 2 200 The The
4 2 201 Other Other End
5 1 201 End End
6 2 201 End Other End