How to group by multiple keys in spark?

前端 未结 2 695
既然无缘
既然无缘 2021-01-02 19:00

I have a bunch of tuples which are in form of composite keys and values. For example,

tfile.collect() = [((\'id1\',\'pd1\',\'t1\'),5.0), 
     ((\'id2\',\'p         


        
2条回答
  •  谎友^
    谎友^ (楼主)
    2021-01-02 19:45

    I grouped ((id1,t1),((p1,5.0),(p2,6.0)) and so on ... as my map function. Later, I reduce using map_group which creates an array for [p1,p2, . . . ] and fills in values in their respective positions.

    def map_group(pgroup):
        x = np.zeros(19)
        x[0] = 1
        value_list = pgroup[1]
        for val in value_list:
            fno = val[0].split('.')[0]
            x[int(fno)-5] = val[1]
        return x
    
    tgbr = tfile.map(lambda d: ((d[0][0],d[0][2]),[(d[0][1],d[1])])) \
                    .reduceByKey(lambda p,q:p+q) \
                    .map(lambda d: (d[0], map_group(d)))
    

    This does feel like an expensive solution in terms of computation. But works for now.

提交回复
热议问题