get min and max from a specific column scala spark dataframe

北城以北 提交于 2020-03-17 04:33:10

问题


I would like to access to the min and max of a specific column from my dataframe but I don't have the header of the column, just its number, so I should I do using scala ?

maybe something like this :

val q = nextInt(ncol) //we pick a random value for a column number
col = df(q)
val minimum = col.min()

Sorry if this sounds like a silly question but I couldn't find any info on SO about this question :/


回答1:


How about getting the column name from the metadata:

val selectedColumnName = df.columns(q) //pull the (q + 1)th column from the columns array
df.agg(min(selectedColumnName), max(selectedColumnName))



回答2:


You can use pattern matching while assigning variable:

import org.apache.spark.sql.functions.{min, max}
import org.apache.spark.sql.Row

val Row(minValue: Double, maxValue: Double) = df.agg(min(q), max(q)).head

Where q is either a Column or a name of column (String). Assuming your data type is Double.




回答3:


You can use the column number to extract the column names first (by indexing df.columns), then aggregate use the column names:

val df = Seq((2.0, 2.1), (1.2, 1.4)).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: double, B: double]

df.agg(max(df(df.columns(1))), min(df(df.columns(1)))).show
+------+------+

|max(B)|min(B)|
+------+------+
|   2.1|   1.4|
+------+------+



回答4:


Here is a direct way to get the min and max from a dataframe with column names:

val df = Seq((1, 2), (3, 4), (5, 6)).toDF("A", "B")

df.show()
/*
+---+---+
|  A|  B|
+---+---+
|  1|  2|
|  3|  4|
|  5|  6|
+---+---+
*/

df.agg(min("A"), max("A")).show()
/*
+------+------+
|min(A)|max(A)|
+------+------+
|     1|     5|
+------+------+
*/

If you want to get the min and max values as separate variables, then you can convert the result of agg() above into a Row and use Row.getInt(index) to get the column values of the Row.

val min_max = df.agg(min("A"), max("A")).head()
// min_max: org.apache.spark.sql.Row = [1,5]

val col_min = min_max.getInt(0)
// col_min: Int = 1

val col_max = min_max.getInt(1)
// col_max: Int = 5



回答5:


Using spark functions min and max, you can find min or max values for any column in a data frame.

import org.apache.spark.sql.functions.{min, max}

val df = Seq((5, 2), (10, 1)).toDF("A", "B")

df.agg(max($"A"), min($"B")).show()
/*
+------+------+
|max(A)|min(B)|
+------+------+
|    10|     1|
+------+------+
*/



回答6:


In Java, we have to explicitly mention org.apache.spark.sql.functions that has implementation for min and max:

datasetFreq.agg(functions.min("Frequency"), functions.max("Frequency")).show();


来源:https://stackoverflow.com/questions/43232363/get-min-and-max-from-a-specific-column-scala-spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!