pyspark - Grouping and calculating data

前端 未结 2 926
星月不相逢
星月不相逢 2020-12-22 06:23

I have the following csv file.

Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a         


        
2条回答
  •  抹茶落季
    2020-12-22 07:11

    You'll have to used groupByKey to get median. While generally not preferred for performance reasons, finding the median value of a list of numbers can not be parallelized easily. The logic to compute median requires the entire list of numbers. groupByKey is the aggregation method to use when you need to process all the values for a key at the same time

    Also, as mentioned in the comments, this task would be easier using Spark DataFrames.

提交回复
热议问题