Pyspark: How to count the number of each equal distance interval in RDD

↘锁芯ラ 提交于 2021-01-29 12:33:11

问题


I have a RDD[Double], I want to divide the RDD into k equal intervals, then count the number of each equal distance interval in RDD.

For example, the RDD is like [0,1,2,3,4,5,6,6,7,7,10]. I want to divided it into 10 equal intervals, so the intervals are [0,1), [1,2), [2,3), [3,4), [4,5), [5,6), [6,7), [7,8), [8,9), [9,10].

As you can see, each element of RDD will be in one of the intervals. Then I want to calculate the number of each interval. Here, there are one element in [0,1),[1,2),[2,3),[3,4),[4,5),[5,6), and both [6,7) and [7,8) have two element. [9,10] has one element.

Finally I expected an array like array([1,1,1,1,1,1,2,2,0,1].


回答1:


Try this. I have assumed that first element of the range is inclusive and last exclusive. Please confirm on this. For example when considering the range [0,1] and element is 0 the condition is element >= 0 and element < 1.

for index_upper, element_upper in enumerate(array_range):
  counter = 0
  for index, element in enumerate(rdd.collect()):
    if element >= element_upper[0] and element < element_upper[1] :
      counter +=1
  countElementsWithinRange.append(counter)

print(rdd.collect())
# [0, 1, 2, 3, 4, 5, 6, 6, 7, 7, 10]
print(countElementsWithinRange)
# [1, 1, 1, 1, 1, 1, 2, 2, 0, 0]


来源:https://stackoverflow.com/questions/62675710/pyspark-how-to-count-the-number-of-each-equal-distance-interval-in-rdd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!