Store aggregate value of a PySpark dataframe column into a variable

前端 未结 6 868
故里飘歌
故里飘歌 2021-01-13 09:37

I am working with PySpark dataframes here. \"test1\" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the resu

6条回答
  •  时光取名叫无心
    2021-01-13 10:11

    You cannot directly access the values in a dataframe. Dataframe returns a Row Object. Instead Dataframe gives you a option to convert it into a python dictionary. Go through the following example where I will calculate average wordcount:

    wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
    wordCountsDF = wordsDF.groupBy(wordsDF['word']).count()
    wordCountsDF.show()
    

    Here are the word count results:

    +--------+-----+
    |    word|count|
    +--------+-----+
    |     cat|    2|
    |     rat|    2|
    |elephant|    1|
    +--------+-----+
    

    Now I calculate the average of count column apply collect() operation on it. Remember collect() returns a list.Here the list contains one element only.

    averageCount = wordCountsDF.groupBy().avg('count').collect()
    

    Result looks something like this.

    [Row(avg(count)=1.6666666666666667)]
    

    You cannot access directly the average value using some python variable. You have to convert it into a dictionary to access it.

    results={}
    for i in averageCount:
      results.update(i.asDict())
    print results
    

    Our final results look like these:

    {'avg(count)': 1.6666666666666667}
    

    Finally you can access average value using:

    print results['avg(count)']
    
    1.66666666667
    

提交回复
热议问题