I have the following csv file.
Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a
If you have a csv data in a file as given in the question then you can use sqlContext to read it as a dataframe and cast the appropriate types as
df = sqlContext.read.format("com.databricks.spark.csv").option("header", True).load("path to csv file")
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.select(F.col('User'), F.col('Model'), F.col('gt'), F.col('x').cast('float'), F.col('y').cast('float'), F.col('z').cast('float'))
I have selected primary keys and necessary columns only which should give you
+----+------+-----+----------+---------+--------+
|User|Model |gt |x |y |z |
+----+------+-----+----------+---------+--------+
|a |nexus4|stand|-5.958191 |0.6880646|8.135345|
|a |nexus4|stand|-5.95224 |0.6702118|8.136536|
|a |nexus4|stand|-5.9950867|0.6535492|8.204376|
|a |nexus4|stand|-5.9427185|0.6761627|8.128204|
+----+------+-----+----------+---------+--------+
All of your requirements: median, deviation, max and min depend on the list of x, y and z when grouped by primary keys: User, Model and gt.
So you would need groupBy and collect_list inbuilt function and a udf function to calculate all of your requiremnts. Final step is to separate them in different columns which are given below
from math import sqrt
def calculation(array):
num_items = len(array)
print num_items, sum(array)
mean = sum(array) / num_items
differences = [x - mean for x in array]
sq_differences = [d ** 2 for d in differences]
ssd = sum(sq_differences)
variance = ssd / (num_items - 1)
sd = sqrt(variance)
return [mean, sd, max(array), min(array)]
calcUdf = F.udf(calculation, T.ArrayType(T.FloatType()))
df.groupBy('User', 'Model', 'gt')\
.agg(calcUdf(F.collect_list(F.col('x'))).alias('x'), calcUdf(F.collect_list(F.col('y'))).alias('y'), calcUdf(F.collect_list(F.col('z'))).alias('z'))\
.select(F.col('User'), F.col('Model'), F.col('gt'), F.col('x')[0].alias('median_x'), F.col('y')[0].alias('median_y'), F.col('z')[0].alias('median_z'), F.col('x')[1].alias('deviation_x'), F.col('y')[1].alias('deviation_y'), F.col('z')[1].alias('deviation_z'), F.col('x')[2].alias('max_x'), F.col('y')[2].alias('max_y'), F.col('z')[2].alias('max_z'), F.col('x')[3].alias('min_x'), F.col('y')[3].alias('min_y'), F.col('z')[3].alias('min_z'))\
.show(truncate=False)
So finally you should have
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
|User|Model |gt |median_x |median_y |median_z|deviation_x|deviation_y|deviation_z|max_x |max_y |max_z |min_x |min_y |min_z |
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
|a |nexus4|stand|-5.962059|0.6719971|8.151115|0.022922019|0.01436464 |0.0356973 |-5.9427185|0.6880646|8.204376|-5.9950867|0.6535492|8.128204|
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
I hope the answer is helpful.