Pyspark - TypeError: 'float' object is not subscriptable when calculating mean using reduceByKey

后端 未结 1 1918
误落风尘
误落风尘 2020-12-04 00:52

my \"asdasd.csv\" file has the following structure.

 Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
0,1424696633908,1424696631913248572,-5.95819         


        
相关标签:
1条回答
  • 2020-12-04 01:46

    Heres how reduceByKey works. I am taking your example for illustration i.e. with following data that you pass to reduceByKey

    #                                 x           y        z
    [(('a', 'nexus4', 'stand'), ((-5.958191, 0.6880646, 8.135345), 1))]
    #           part A (key)               part B (value)       counter
    

    Let me go step by step

    After performing the following mapValues function

    rdd_ori.mapValues(lambda x: (x,1))
    

    the rdd data will look as

    ((u'a', u'nexus4', u'stand'), ((-5.9427185, 0.6761626999999999, 8.128204), 1))
    ((u'a', u'nexus4', u'stand'), ((-5.958191, 0.6880646, 8.135345), 1))
    ((u'a', u'nexus4', u'stand'), ((-5.95224, 0.6702118, 8.136536), 1))
    ((u'a', u'nexus4', u'stand'), ((-5.9950867, 0.6535491999999999, 8.204376), 1))
    

    So when reduceByKey is invoked as

    .reduceByKey(lambda a, b: (a[0][0] + b[0][0], a[0][1] + b[0][1], a[0][2] + b[0][2], a[1] + b[1]))
    

    And all the rows with same key are grouped and values are passed to the lambda function of reducyByKey.

    Since in your case, all the keys are same, the values are passed to a and b variables in the following iterations.

    In first iteration, a is ((-5.9427185, 0.6761626999999999, 8.128204), 1) and b is ((-5.958191, 0.6880646, 8.135345), 1) so the calculation part (a[0][0] + b[0][0], a[0][1] + b[0][1], a[0][2] + b[0][2], a[1] + b[1]) is correct and passes.

    In second iteration, a is the output of (a[0][0] + b[0][0], a[0][1] + b[0][1], a[0][2] + b[0][2], a[1] + b[1]) which is (-11.910430999999999, 1.3582764, 16.271881, 2)

    So if you look at the format of the data there is no such a[0][0] in a. You can just get a[0], a[1] .. and so on. So thats the issue. And thats what the error message is suggesting too.

    TypeError: 'float' object is not subscriptable

    The solution to this is to format the data so that you can access a as a[0][0] which can be done if you format your reduceByKey of the following format.

    .reduceByKey(lambda a, b: ((a[0][0] + b[0][0], a[0][1] + b[0][1], a[0][2] + b[0][2]), a[1] + b[1]))
    

    But that would trouble your last mapValues function

    .mapValues(lambda a : (a[0]/a[3], a[1]/a[3],a[2]/a[3]))
    

    as your values, i.e. a in lambda function, are of ((-23.848236199999995, 2.6879882999999998, 32.604461), 4) so a[0] means (-23.848236199999995, 2.6879882999999998, 32.604461) and a[1] means 4 and there aren't any more so you will encounter

    IndexError: tuple index out of range

    So your last mapValues should be

    .mapValues(lambda a : (a[0][0]/a[1], a[0][1]/a[1],a[0][2]/a[1]))
    

    So overall, following code should work for you

    rdd_ori = sc.textFile("asdasd.csv") \
        .map(lambda x: ((x.split(",")[6], x.split(",")[7], x.split(",")[9]),(float(x.split(",")[3]),float(x.split(",")[4]),float(x.split(",")[5]))))
    
    meanRDD = rdd_ori.mapValues(lambda x: (x, 1)) \
        .reduceByKey(lambda a, b: ((a[0][0] + b[0][0], a[0][1] + b[0][1], a[0][2] + b[0][2]), a[1] + b[1]))\
        .mapValues(lambda a : (a[0][0]/a[1], a[0][1]/a[1],a[0][2]/a[1]))
    

    I hope I have explained it well enough.

    0 讨论(0)
提交回复
热议问题