Calculate average using Spark Scala

后端 未结 4 1452
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-28 09:29

How do I calculate the Average salary per location in Spark Scala with below two data sets ?

File1.csv(Column 4 is salary)

Ram, 30, Engineer, 40000  
B         


        
4条回答
  •  庸人自扰
    2021-01-28 10:11

    I would use DataFrame API, this should work:

    val salary = sc.textFile("File1.csv")
                   .map(e => e.split(","))
                   .map{case Seq(name,_,_,salary) => (name,salary)}
                   .toDF("name","salary")
    
    val location = sc.textFile("File2.csv")
                     .map(e => e.split(","))
                     .map{case Seq(name,location) => (name,location)}
                     .toDF("name","location")
    
    import org.apache.spark.sql.functions._
    
    salary
      .join(location,Seq("name"))
      .groupBy($"location")
      .agg(
        avg($"salary").as("avg_salary")
      )
      .repartition(1)
      .write.csv("output.csv")
    

提交回复
热议问题