Spark Java Error: Size exceeds Integer.MAX_VALUE

☆樱花仙子☆ 提交于 2019-11-30 06:40:59
Daniel Langdon

The Integer.MAX_INT restriction is on the size of a file being stored. 1.2M rows is not a big thing, to I'm not sure your problem is "the limits of spark". More likely, some part of your work is creating something too big to be handled by any given executor.

I'm no Python coder, but when you "hashed the features of the records" you might be taking a very sparse set of records for a sample and creating an non-sparse array. This will mean a lot of memory for 16384 features. Particularly, when you do zip(line[1].indices, line[1].data). The only reason that doesn't get you out of memory right there is the shitload of it you seem to have configured (50G).

Another thing that might help is to increase the partitioning. So if you can't make your rows use less memory, at least you can try having fewer rows on any given task. Any temporary files being created are likely to depend on this, so you'll be more unlikely to hit file limits.


And, totally unrelated to the error but relevant for what you are trying to do:

16384 is indeed a big number of features, in the optimistic case where each one is just a boolean feature, you have a total of 2^16384 possible permutations to learn from, this is a huge number(try it here: https://defuse.ca/big-number-calculator.htm).

It is VERY, VERY likely that no algorithm will be able to learn a decision boundary with just 1.2M samples, you would probably need at least a few trillion trillion examples to make a dent on such a feature space. Machine Learning has its limitations, so don't be surprised if you don't get better-than-random accuracy.

I would definitely recommend trying some sort of dimensionality reduction first!!

At some point, it tries to store the features and 1.2M * 16384 is greater than Integer.MAX_INT so you are trying to store more than than maximum size of features supported by Spark.

You're probably running into the limits of Apache Spark.

gsamaras

Increasing the number of partitions may cause Active tasks is a negative number in Spark UI, which probably means that the number of partitions is too high.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!