Apache Spark timing forEach operation on JavaRDD

心已入冬 提交于 2019-12-25 07:14:43

问题


Question: Is this a valid way to test the time taken to build a RDD?

I am doing two things here. The basic approach is that we have M instances of what we call a DropEvaluation, and N DropResults. We need to compare each N DropResult to each of M DropEvaluations. Each N must be seen by each M, to give us M results in the end.

If I don't use the .count() once the RDD is built, the driver continues on to the next line of code and says it look almost no time to build a RDD that takes 30 minutes to build.

I just want to make sure I am not missing something, like maybe the .count() taking a long time? I guess to time the .count() I'd have to modify Spark's source?

M = 1000 or 2000. N = 10^7.

It's effectively a cartesian problem -- the accumulator was chosen because we need to write to each M in place. I would also be ugly to build the full cartesian RDD.

We build a List of M Accumulators (can't do a List Accumulator in Java right?). Then we loop through each of N in a RDD with a foreach.

Clarifying the Question: The total time taken is measured correctly, I am asking if the .count() on the RDD forces Spark to wait until the RDD is finished before it can run a count. Is the .count() time significant?

Here's our code:

// assume standin exists and does it's thing correctly

// this controls the final size of RDD, as we are not parallelizing something with an existing length
List<Integer> rangeN = IntStream.rangeClosed(simsLeft - blockSize + 1, simsLeft).boxed().collect(Collectors.toList());

// setup bogus array of size N for parallelize dataSetN to lead to dropResultsN       
JavaRDD<Integer> dataSetN = context.parallelize(rangeN);

// setup timing to create N
long NCreationStartTime = System.nanoTime();

// this maps each integer element of RDD dataSetN to a "geneDropped" chromosome simulation, we need N of these:
JavaRDD<TholdDropResult> dropResultsN = dataSetN.map(s -> standin.call(s)).persist(StorageLevel.MEMORY_ONLY());

// **** this line makes the driver wait until the RDD is done, right?
long dummyLength = dropResultsN.count();


long NCreationNanoSeconds = System.nanoTime() - NCreationStartTime;
double NCreationSeconds = (double)NCreationNanoSeconds / 1000000000.0;
double NCreationMinutes = NCreationSeconds / 60.0;

logger.error("{} test sims remaining", simsLeft);

// now get the time for just the dropComparison (part of accumulable's add)
long startDropCompareTime = System.nanoTime();

// here we iterate through each accumulator in the list and compare all N elements of dropResultsN RDD to each M in turn, our .add() is a custom AccumulableParam
for (Accumulable<TholdDropTuple, TholdDropResult> dropEvalAccum : accumList) {
    dropResultsN.foreach(new VoidFunction<TholdDropResult>() {
                    @Override
                    public void call(TholdDropResult dropResultFromN) throws Exception {
                            dropEvalAccum.add(dropResultFromN);
                    }
                });
            }

    // all the dropComparisons for all N to all M for this blocksize are done, check the time...
   long dropCompareNanoSeconds = System.nanoTime() - startDropCompareTime;
   double dropCompareSeconds = (double)dropCompareNanoSeconds / 1000000000.0;
    double dropCompareMinutes = dropCompareSeconds / 60.0;

    // write lines to indicate timing section
    // log and write to file the time for the N-creation

    ...

} // end for that goes through dropAccumList

回答1:


Spark program is lazy, it will not run until you call all action like count on the RDD. You can find a list of common action in Spark's document

// **** this line makes the driver wait until the RDD is done, right?
long dummyLength = dropResultsN.count();

Yes, in this case count force the dropResultsN to be computed, so it'll take a long time. If you do a second count, it'll return very fast since the RDD is already computed and cached.



来源:https://stackoverflow.com/questions/38296950/apache-spark-timing-foreach-operation-on-javardd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!