Spark java.lang.StackOverflowError

后端 未结 3 995
囚心锁ツ
囚心锁ツ 2020-12-16 19:07

I\'m using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (

3条回答
  •  旧时难觅i
    2020-12-16 19:24

    I have multiple suggestions which will help you to greatly improve the performance of the code in your question.

    1. Caching: Caching should be used on those data sets which you need to refer to again and again for same/ different operations (iterative algorithms.

    An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write RDD.count, at this point the file will be read, the lines will be counted, and the count will be returned.

    What if you call RDD.count again? The same thing: the file will be read and counted again. So what does RDD.cache do? Now, if you run RDD.count the first time, the file will be loaded, cached, and counted. If you call RDD.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines, no recomputing.

    Read more about caching here.

    In your code sample you are not reusing anything that you've cached. So you may remove the .cache from there.

    1. Parallelization: In the code sample, you've parallelized every individual element in your RDD which is already a distributed collection. I suggest you to merge the rddFileData, rddMovieData and rddPairReviewData steps so that it happens in one go.

    Get rid of .collect since that brings the results back to the driver and maybe the actual reason for your error.

提交回复
热议问题