I have the following csv file.
Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a
You'll have to used groupByKey to get median. While generally not preferred for performance reasons, finding the median value of a list of numbers can not be parallelized easily. The logic to compute median requires the entire list of numbers. groupByKey is the aggregation method to use when you need to process all the values for a key at the same time
Also, as mentioned in the comments, this task would be easier using Spark DataFrames.