Store aggregate value of a PySpark dataframe column into a variable

前端未结

关注

 6  868

故里飘歌 2021-01-13 09:37

I am working with PySpark dataframes here. \"test1\" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the resu

6条回答

时光取名叫无心 (楼主)

2021-01-13 10:11
You cannot directly access the values in a dataframe. Dataframe returns a Row Object. Instead Dataframe gives you a option to convert it into a python dictionary. Go through the following example where I will calculate average wordcount:
```
wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordCountsDF = wordsDF.groupBy(wordsDF['word']).count()
wordCountsDF.show()
```
Here are the word count results:
```
+--------+-----+
|    word|count|
+--------+-----+
|     cat|    2|
|     rat|    2|
|elephant|    1|
+--------+-----+
```
Now I calculate the average of count column apply collect() operation on it. Remember collect() returns a list.Here the list contains one element only.
```
averageCount = wordCountsDF.groupBy().avg('count').collect()
```
Result looks something like this.
```
[Row(avg(count)=1.6666666666666667)]
```
You cannot access directly the average value using some python variable. You have to convert it into a dictionary to access it.
```
results={}
for i in averageCount:
  results.update(i.asDict())
print results
```
Our final results look like these:
```
{'avg(count)': 1.6666666666666667}
```
Finally you can access average value using:
```
print results['avg(count)']

1.66666666667
```
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...