How to get Histogram of all columns in a large CSV / RDD[Array[double]] using Apache Spark Scala?
问题 I am trying to calculate Histogram of all columns from a CSV file using Spark Scala. I found that DoubleRDDFunctions supporting Histogram. So I coded like following for getting histogram of all columns. Get column count Create RDD[double] of each column and calculate Histogram of each RDD using DoubleRDDFunctions var columnIndexArray = Array.tabulate(rdd.first().length) (_ * 1) val histogramData = columnIndexArray.map(columns => { rdd.map(lines => lines(columns)).histogram(6) }) Is it a good