Why does Spark RDD partition has 2GB limit for HDFS?

后端 未结 3 1339
梦毁少年i
梦毁少年i 2020-12-01 08:09

I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default partition is relative small. so an exception thrown indicating that \"

3条回答
  •  悲哀的现实
    2020-12-01 09:09

    the problem is when using datastores like Casandra, HBase, or Accumulo the block size is based on the datastore splits (which can be over 10 gig). when loading data from these datastores you have to immediately repartitions with 1000s of partitions so you can operated the data without blowing the 2gig limit.

    most people that use spark are not really using large data; to them if it is bigger that excel can hold or tableau is is big data to them; mostly data scientist who use quality data or use a sample size small enough to work with the limit.

    when processing large volumes of data, i end of having to go back to mapreduce and only used spark once the data has been cleaned up. this is unfortunate however, the majority of the spark community has no interest in addressing the issue.

    a simple solution would be to create an abstraction and use bytearray as default; however, allow to overload a spark job with an 64bit data pointer to handle the large jobs.

提交回复
热议问题