How to process this in parallel on cluster using MapFunction and ReduceFunction of spark-java api?

问题

I am using spark-sql-2.4.1v with java8.

Have to do a complex calculation using group by on various conditions using java api i.e. using MapFunction and ReduceFunction.

Scenario :

Have source data given sample as below

+--------+--------------+-----------+-------------+---------+------+
| country|generated_date|industry_id|industry_name|  revenue| state|
+--------+--------------+-----------+-------------+---------+------+
|Country1|    2020-03-01|    Indus_1| Indus_1_Name| 12789979|State1|
|Country1|    2019-06-01|    Indus_1| Indus_1_Name| 56189008|State1|
|Country1|    2019-03-01|    Indus_1| Indus_1_Name| 12789979|State1|
|Country1|    2020-03-01|    Indus_2| Indus_2_Name| 21789933|State2|
|Country1|    2018-03-01|    Indus_2| Indus_2_Name|300789933|State2|
|Country1|    2019-03-01|    Indus_3| Indus_3_Name| 27989978|State3|
|Country1|    2017-06-01|    Indus_3| Indus_3_Name| 56189008|State3|
|Country1|    2017-03-01|    Indus_3| Indus_3_Name| 30014633|State3|
|Country2|    2020-03-01|    Indus_4| Indus_4_Name| 41789978|State1|
|Country2|    2018-03-01|    Indus_4| Indus_4_Name| 56189008|State1|
|Country3|    2019-03-01|    Indus_5| Indus_5_Name| 37899790|State3|
|Country3|    2018-03-01|    Indus_5| Indus_5_Name| 56189008|State3|
|Country3|    2017-03-01|    Indus_5| Indus_5_Name| 67789978|State3|
|Country1|    2020-03-01|    Indus_6| Indus_6_Name| 12789979|State1|
|Country1|    2020-06-01|    Indus_6| Indus_6_Name| 37899790|State1|
|Country1|    2018-03-01|    Indus_6| Indus_6_Name| 56189008|State1|
|Country3|    2020-03-01|    Indus_7| Indus_7_Name| 26689900|State1|
|Country3|    2020-12-01|    Indus_7| Indus_7_Name|212359979|State1|
|Country3|    2019-03-01|    Indus_7| Indus_7_Name| 12789979|State1|
|Country1|    2018-03-01|    Indus_8| Indus_8_Name|212359979|State2|
+--------+--------------+-----------+-------------+---------+------+

Need to calculate various calculation like avg(revenue) for each given group for given dates , able to do it but not at all scaling in spark-cluster.

For the same I am doing below thing but this is not at all scaling...hence understood I need to use MapFunction and ReduceFunction of java.. not sure how to do it ?

//Will get dates to for which I need to calculate , this provided by external source 
        List<String> datesToCalculate = Arrays.asList("2019-03-01","2020-06-01","2018-09-01");

        //Will get groups  to calculate , this provided by external source ..will keep changing
        //Have around 100s of groups.
        List<String> groupsToCalculate = Arrays.asList("Country","Country-State");

        //For each data given need to calculate avg(revenue) for each given group 
        //for those given each date of datesToCalculate for those records whose are later than given date.
        //i.e. 

        //Now I am doing some thing like this..but it is not scaling

        datesToCalculate.stream().forEach( cal_date -> {

            Dataset<IndustryRevenue> calc_ds = ds.where(col("generated_date").gt(lit(cal_date)));

            //this keep changing for each cal_date
            Dataset<Row> final_ds = calc_ds
                                      .withColumn("calc_date", to_date(lit(cal_date)).cast(DataTypes.DateType));

            //for each group it calcuate separate set
            groupsToCalculate.stream().forEach( group -> {

                String tempViewName = new String("view_" + cal_date + "_" + group);

                final_ds.createOrReplaceTempView(tempViewName);

                String query = "select "  
                                  + " avg(revenue) as mean, "
                                  + "from " + tempViewName                      
                                  + " group by " + group;

                System.out.println("query : " + query);
                Dataset<Row> resultDs  = spark.sql(query);

                Dataset<Row> finalResultDs  =  resultDs
                                 .withColumn("calc_date", to_date(lit(cal_date)).cast(DataTypes.DateType))
                                 .withColumn("group", to_date(lit(group)).cast(DataTypes.DateType));


                //Writing to each group for each date is taking hell lot of time.
                // For each record it is save at a time
                // want to move out unioning all finalResultDs and write in batches
                finalResultDs
                   .write().format("parquet")
                   .mode("append")
                   .save("/tmp/"+ tempViewName);

                spark.catalog().dropTempView(tempViewName);

            });

        });

Due to for-loops it is taking more than 20hrs for processing few million records. so how to avoid forloops and make it run it quickly.

Here is the sample code

https://github.com/BdLearnerr/Java-mapReduce/blob/master/MapReduceScalingProblem.java

来源：https://stackoverflow.com/questions/61391531/how-to-process-this-in-parallel-on-cluster-using-mapfunction-and-reducefunction

标签

java