Sparklyr: Use group_by and then concatenate strings from rows in a group

后端 未结 1 668
一个人的身影
一个人的身影 2020-12-11 08:11

I am trying to use the group_by() and mutate() functions in sparklyr to concatenate rows in a group.

Here is a simple example that I think should work but doesn\'t:<

相关标签:
1条回答
  • 2020-12-11 08:54

    Spark sql doesn't like it if you use aggregate functions without aggregating, hence the reason why this works in dplyr with an ordinary dataframe but not in a SparkDataFrame- sparklyr translates your commands to an sql statement. You can observe this going wrong if you look at the second bit in the error message:

    == SQL ==
    SELECT `id`, `x`, CONCAT_WS(' ', `y`, ' ' AS "collapse") AS `y`
    

    paste gets translated to CONCAT_WS. concat however would paste columns together.

    A better equivalent would be collect_list and collect_set, but they produce list outputs.

    But you can build on that:

    If you do not want to have the same row replicated in your result you can use summarise, collect_list, and paste:

    res <- d_sdf %>% 
          group_by(id, x) %>% 
          summarise( yconcat =paste(collect_list(y)))
    

    result:

    Source:     lazy query [?? x 3]
    Database:   spark connection master=local[8] app=sparklyr local=TRUE
    Grouped by: id
    
         id     x         y
      <chr> <chr>     <chr>
    1     1   201       End
    2     2   201 Other End
    3     1   200 This That
    4     2   200       The
    

    you can join this back onto your original data if you do want to have your rows replicated:

    d_sdf %>% left_join(res)
    

    result:

    Source:     lazy query [?? x 4]
    Database:   spark connection master=local[8] app=sparklyr local=TRUE
    
         id     x     y   yconcat
      <chr> <chr> <chr>     <chr>
    1     1   200  This This That
    2     1   200  That This That
    3     2   200   The       The
    4     2   201 Other Other End
    5     1   201   End       End
    6     2   201   End Other End
    
    0 讨论(0)
提交回复
热议问题