问题
I am using spark-sql-2.4.1v with java8.
Have to do a complex calculation using group by on various conditions using java api i.e. using MapFunction and ReduceFunction.
Scenario :
Have source data given sample as below
+--------+--------------+-----------+-------------+---------+------+
| country|generated_date|industry_id|industry_name| revenue| state|
+--------+--------------+-----------+-------------+---------+------+
|Country1| 2020-03-01| Indus_1| Indus_1_Name| 12789979|State1|
|Country1| 2019-06-01| Indus_1| Indus_1_Name| 56189008|State1|
|Country1| 2019-03-01| Indus_1| Indus_1_Name| 12789979|State1|
|Country1| 2020-03-01| Indus_2| Indus_2_Name| 21789933|State2|
|Country1| 2018-03-01| Indus_2| Indus_2_Name|300789933|State2|
|Country1| 2019-03-01| Indus_3| Indus_3_Name| 27989978|State3|
|Country1| 2017-06-01| Indus_3| Indus_3_Name| 56189008|State3|
|Country1| 2017-03-01| Indus_3| Indus_3_Name| 30014633|State3|
|Country2| 2020-03-01| Indus_4| Indus_4_Name| 41789978|State1|
|Country2| 2018-03-01| Indus_4| Indus_4_Name| 56189008|State1|
|Country3| 2019-03-01| Indus_5| Indus_5_Name| 37899790|State3|
|Country3| 2018-03-01| Indus_5| Indus_5_Name| 56189008|State3|
|Country3| 2017-03-01| Indus_5| Indus_5_Name| 67789978|State3|
|Country1| 2020-03-01| Indus_6| Indus_6_Name| 12789979|State1|
|Country1| 2020-06-01| Indus_6| Indus_6_Name| 37899790|State1|
|Country1| 2018-03-01| Indus_6| Indus_6_Name| 56189008|State1|
|Country3| 2020-03-01| Indus_7| Indus_7_Name| 26689900|State1|
|Country3| 2020-12-01| Indus_7| Indus_7_Name|212359979|State1|
|Country3| 2019-03-01| Indus_7| Indus_7_Name| 12789979|State1|
|Country1| 2018-03-01| Indus_8| Indus_8_Name|212359979|State2|
+--------+--------------+-----------+-------------+---------+------+
Need to calculate various calculation like avg(revenue) for each given group for given dates , able to do it but not at all scaling in spark-cluster.
For the same I am doing below thing but this is not at all scaling...hence understood I need to use MapFunction and ReduceFunction of java.. not sure how to do it ?
//Will get dates to for which I need to calculate , this provided by external source
List<String> datesToCalculate = Arrays.asList("2019-03-01","2020-06-01","2018-09-01");
//Will get groups to calculate , this provided by external source ..will keep changing
//Have around 100s of groups.
List<String> groupsToCalculate = Arrays.asList("Country","Country-State");
//For each data given need to calculate avg(revenue) for each given group
//for those given each date of datesToCalculate for those records whose are later than given date.
//i.e.
//Now I am doing some thing like this..but it is not scaling
datesToCalculate.stream().forEach( cal_date -> {
Dataset<IndustryRevenue> calc_ds = ds.where(col("generated_date").gt(lit(cal_date)));
//this keep changing for each cal_date
Dataset<Row> final_ds = calc_ds
.withColumn("calc_date", to_date(lit(cal_date)).cast(DataTypes.DateType));
//for each group it calcuate separate set
groupsToCalculate.stream().forEach( group -> {
String tempViewName = new String("view_" + cal_date + "_" + group);
final_ds.createOrReplaceTempView(tempViewName);
String query = "select "
+ " avg(revenue) as mean, "
+ "from " + tempViewName
+ " group by " + group;
System.out.println("query : " + query);
Dataset<Row> resultDs = spark.sql(query);
Dataset<Row> finalResultDs = resultDs
.withColumn("calc_date", to_date(lit(cal_date)).cast(DataTypes.DateType))
.withColumn("group", to_date(lit(group)).cast(DataTypes.DateType));
//Writing to each group for each date is taking hell lot of time.
// For each record it is save at a time
// want to move out unioning all finalResultDs and write in batches
finalResultDs
.write().format("parquet")
.mode("append")
.save("/tmp/"+ tempViewName);
spark.catalog().dropTempView(tempViewName);
});
});
Due to for-loops it is taking more than 20hrs for processing few million records. so how to avoid forloops and make it run it quickly.
Here is the sample code
https://github.com/BdLearnerr/Java-mapReduce/blob/master/MapReduceScalingProblem.java
来源:https://stackoverflow.com/questions/61391531/how-to-process-this-in-parallel-on-cluster-using-mapfunction-and-reducefunction