I am new to spark, and I want to use group-by & reduce to find the following from CSV (one line by employed):
Department, Designation, costToCompany, S
CSV file can be parsed with Spark built-in CSV reader. It will return DataFrame/DataSet on the successful read of the file. On top of DataFrame/DataSet, you apply SQL-like operations easily.
sparkimport org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL Example")
.getOrCreate();
StructTypeimport org.apache.spark.sql.types.StructType;
StructType schema = new StructType()
.add("department", "string")
.add("designation", "string")
.add("ctc", "long")
.add("state", "string");
Dataset df = spark.read()
.option("mode", "DROPMALFORMED")
.schema(schema)
.csv("hdfs://path/input.csv");
more option on reading data from CSV file
1. SQL way
Register a table in spark sql metastore to perform SQL operation
df.createOrReplaceTempView("employee");Run SQL query on registered dataframe
DatasetsqlResult = spark.sql( "SELECT department, designation, state, SUM(ctc), COUNT(department)" + " FROM employee GROUP BY department, designation, state"); sqlResult.show(); //for testing
We can even execute SQL directly on CSV file with out creating table with Spark SQL
2. Object chaining or Programming or Java-like way
Do the necessary import for sql functions
import static org.apache.spark.sql.functions.count; import static org.apache.spark.sql.functions.sum;Use
groupByandaggon dataframe/dataset to performcountandsumon dataDatasetdfResult = df.groupBy("department", "designation", "state") .agg(sum("ctc"), count("department")); // After Spark 1.6 columns mentioned in group by will be added to result by default dfResult.show();//for testing
"org.apache.spark" % "spark-core_2.11" % "2.0.0"
"org.apache.spark" % "spark-sql_2.11" % "2.0.0"