Calculate average using Spark Scala

后端 未结 4 1474
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-28 09:29

How do I calculate the Average salary per location in Spark Scala with below two data sets ?

File1.csv(Column 4 is salary)

Ram, 30, Engineer, 40000  
B         


        
4条回答
  •  野性不改
    2021-01-28 10:15

    You can read the CSV files as DataFrames, then join and group them to get the averages:

    val df1 = spark.read.csv("/path/to/file1.csv").toDF(
      "name", "age", "title", "salary"
    )
    
    val df2 = spark.read.csv("/path/to/file2.csv").toDF(
      "name", "location"
    )
    
    import org.apache.spark.sql.functions._
    
    val dfAverage = df1.join(df2, Seq("name")).
      groupBy(df2("location")).agg(avg(df1("salary")).as("average")).
      select("location", "average")
    
    dfAverage.show
    +-----------+-------+
    |   location|average|
    +-----------+-------+
    |Bangalore  |40000.0|
    |  Chennai  |50000.0|
    +-----------+-------+
    

    [UPDATE] For calculating average dimensions:

    // file1.csv:
    Ram,30,Engineer,40000,600*200
    Bala,27,Doctor,30000,800*400
    Hari,33,Engineer,50000,700*300
    Siva,35,Doctor,60000,600*200
    
    // file2.csv
    Hari,Bangalore
    Ram,Chennai
    Bala,Bangalore
    Siva,Chennai
    
    val df1 = spark.read.csv("/path/to/file1.csv").toDF(
      "name", "age", "title", "salary", "dimensions"
    )
    
    val df2 = spark.read.csv("/path/to/file2.csv").toDF(
      "name", "location"
    )
    
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types.IntegerType
    
    val dfAverage = df1.join(df2, Seq("name")).
      groupBy(df2("location")).
      agg(
        avg(split(df1("dimensions"), ("\\*")).getItem(0).cast(IntegerType)).as("avg_length"),
        avg(split(df1("dimensions"), ("\\*")).getItem(1).cast(IntegerType)).as("avg_width")
      ).
      select(
        $"location", $"avg_length", $"avg_width",
        concat($"avg_length", lit("*"), $"avg_width").as("avg_dimensions")
      )
    
    dfAverage.show
    +---------+----------+---------+--------------+
    | location|avg_length|avg_width|avg_dimensions|
    +---------+----------+---------+--------------+
    |Bangalore|     750.0|    350.0|   750.0*350.0|
    |  Chennai|     600.0|    200.0|   600.0*200.0|
    +---------+----------+---------+--------------+
    

提交回复
热议问题