How to compute percentiles in Apache Spark

后端 未结 10 538
遥遥无期
遥遥无期 2020-12-04 22:08

I have an rdd of integers (i.e. RDD[Int]) and what I would like to do is to compute the following ten percentiles: [0th, 10th, 20th, ..., 90th, 100th]

10条回答
  •  萌比男神i
    2020-12-04 22:57

    If you don't mind converting your RDD to a DataFrame, and using a Hive UDAF, you can use percentile. Assuming you've loaded HiveContext hiveContext into scope:

    hiveContext.sql("SELECT percentile(x, array(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9)) FROM yourDataFrame")

    I found out about this Hive UDAF in this answer.

提交回复
热议问题