How to do custom operations on GroupedData in Spark?

后端 未结 1 835
情深已故
情深已故 2020-12-17 01:41

I want to rewrite some of my code written with RDDs to use DataFrames. It was working quite smoothly until I found this:

 events
  .keyBy(row => (row.getS         


        
相关标签:
1条回答
  • 2020-12-17 02:20

    GroupedData cannot be used directly. Data is not physically grouped and it is just a logical operation. You have to apply some variant of agg method for example:

    events
     .groupBy($"service_id", $"client_create_timestamp", $"client_id")
     .min("client_send_timestamp")
    

    or

    events
     .groupBy($"service_id", $"client_create_timestamp", $"client_id")
     .agg(min($"client_send_timestamp"))
    

    where client_send_timestamp is a column you want to aggregate.

    If you want to keep information than aggregate just join or use Window functions - see Find maximum row per group in Spark DataFrame

    Spark also supports User Defined Aggregate Functions - see How to define and use a User-Defined Aggregate Function in Spark SQL?

    Spark 2.0+

    You could use Dataset.groupByKey which exposes groups as an iterator.

    0 讨论(0)
提交回复
热议问题