I am new to spark, and I want to use group-by & reduce to find the following from CSV (one line by employed):
Department, Designation, costToCompany, S
The following might not be entirely correct, but it should give you some idea of how to juggle data. It's not pretty, should be replaced with case classes etc, but as a quick example of how to use the spark api, I hope it's enough :)
val rawlines = sc.textfile("hdfs://.../*.csv")
case class Employee(dep: String, des: String, cost: Double, state: String)
val employees = rawlines
.map(_.split(",") /*or use a proper CSV parser*/
.map( Employee(row(0), row(1), row(2), row(3) )
# the 1 is the amount of employees (which is obviously 1 per line)
val keyVals = employees.map( em => (em.dep, em.des, em.state), (1 , em.cost))
val results = keyVals.reduceByKey{ a,b =>
(a._1 + b._1, b._1, b._2) # (a.count + b.count , a.cost + b.cost )
}
#debug output
results.take(100).foreach(println)
results
.map( keyval => someThingToFormatAsCsvStringOrWhatever )
.saveAsTextFile("hdfs://.../results")
Or you can use SparkSQL:
val sqlContext = new SQLContext(sparkContext)
# case classes can easily be registered as tables
employees.registerAsTable("employees")
val results = sqlContext.sql("""select dep, des, state, sum(cost), count(*)
from employees
group by dep,des,state"""