Parse CSV as DataFrame/DataSet with Apache Spark and Java

前端 未结 4 903
灰色年华
灰色年华 2020-12-07 16:54

I am new to spark, and I want to use group-by & reduce to find the following from CSV (one line by employed):

  Department, Designation, costToCompany, S         


        
4条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-12-07 17:27

    The following might not be entirely correct, but it should give you some idea of how to juggle data. It's not pretty, should be replaced with case classes etc, but as a quick example of how to use the spark api, I hope it's enough :)

    val rawlines = sc.textfile("hdfs://.../*.csv")
    case class Employee(dep: String, des: String, cost: Double, state: String)
    val employees = rawlines
      .map(_.split(",") /*or use a proper CSV parser*/
      .map( Employee(row(0), row(1), row(2), row(3) )
    
    # the 1 is the amount of employees (which is obviously 1 per line)
    val keyVals = employees.map( em => (em.dep, em.des, em.state), (1 , em.cost))
    
    val results = keyVals.reduceByKey{ a,b =>
        (a._1 + b._1, b._1, b._2) # (a.count + b.count , a.cost + b.cost )
    }
    
    #debug output
    results.take(100).foreach(println)
    
    results
      .map( keyval => someThingToFormatAsCsvStringOrWhatever )
      .saveAsTextFile("hdfs://.../results")
    

    Or you can use SparkSQL:

    val sqlContext = new SQLContext(sparkContext)
    
    # case classes can easily be registered as tables
    employees.registerAsTable("employees")
    
    val results = sqlContext.sql("""select dep, des, state, sum(cost), count(*) 
      from employees 
      group by dep,des,state"""
    

提交回复
热议问题