Parse CSV as DataFrame/DataSet with Apache Spark and Java

前端 未结 4 910
灰色年华
灰色年华 2020-12-07 16:54

I am new to spark, and I want to use group-by & reduce to find the following from CSV (one line by employed):

  Department, Designation, costToCompany, S         


        
4条回答
  •  天涯浪人
    2020-12-07 17:28

    CSV file can be parsed with Spark built-in CSV reader. It will return DataFrame/DataSet on the successful read of the file. On top of DataFrame/DataSet, you apply SQL-like operations easily.

    Using Spark 2.x(and above) with Java

    Create SparkSession object aka spark

    import org.apache.spark.sql.SparkSession;
    
    SparkSession spark = SparkSession
        .builder()
        .appName("Java Spark SQL Example")
        .getOrCreate();
    

    Create Schema for Row with StructType

    import org.apache.spark.sql.types.StructType;
    
    StructType schema = new StructType()
        .add("department", "string")
        .add("designation", "string")
        .add("ctc", "long")
        .add("state", "string");
    

    Create dataframe from CSV file and apply schema to it

    Dataset df = spark.read()
        .option("mode", "DROPMALFORMED")
        .schema(schema)
        .csv("hdfs://path/input.csv");
    

    more option on reading data from CSV file

    Now we can aggregation on data in 2 ways

    1. SQL way

    Register a table in spark sql metastore to perform SQL operation

    df.createOrReplaceTempView("employee");
    

    Run SQL query on registered dataframe

    Dataset sqlResult = spark.sql(
        "SELECT department, designation, state, SUM(ctc), COUNT(department)" 
            + " FROM employee GROUP BY department, designation, state");
    
    sqlResult.show(); //for testing
    

    We can even execute SQL directly on CSV file with out creating table with Spark SQL


    2. Object chaining or Programming or Java-like way

    Do the necessary import for sql functions

    import static org.apache.spark.sql.functions.count;
    import static org.apache.spark.sql.functions.sum;
    

    Use groupBy and agg on dataframe/dataset to perform count and sum on data

    Dataset dfResult = df.groupBy("department", "designation", "state")
        .agg(sum("ctc"), count("department"));
    // After Spark 1.6 columns mentioned in group by will be added to result by default
    
    dfResult.show();//for testing
    

    dependent libraries

    "org.apache.spark" % "spark-core_2.11" % "2.0.0" 
    "org.apache.spark" % "spark-sql_2.11" % "2.0.0"
    

提交回复
热议问题