I\'m using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (
I have multiple suggestions which will help you to greatly improve the performance of the code in your question.
An example is RDD.
count— to tell you the number of lines in the file, the file needs to be read. So if you write RDD.count, at this point the file will be read, the lines will be counted, and the count will be returned.What if you call RDD.
countagain? The same thing: the file will be read and counted again. So what does RDD.cachedo? Now, if you run RDD.countthe first time, the file will be loaded, cached, and counted. If you call RDD.counta second time, the operation will use the cache. It will just take the data from the cache and count the lines, no recomputing.
Read more about caching here.
In your code sample you are not reusing anything that you've cached. So you may remove the .cache from there.
rddFileData, rddMovieData and rddPairReviewData steps so that it happens in one go. Get rid of .collect since that brings the results back to the driver and maybe the actual reason for your error.