Spark job execution time exponentially increases with very wide dataset and number of columns [duplicate]

 ̄綄美尐妖づ 提交于 2019-12-01 01:12:00

If you are having more number of columns, it is better to read/convert the record as an array and use the slice function to map it to individual columns. Using substring to get individual columns will not be that efficient.

EDIT 1:

I used a Array[String] as an example by attaching it to a case class Record() in scala. You can extend it to hdfs textfiles

scala> case class Record(a1:String,a2:Int,a3:java.time.LocalDate)
defined class Record

scala>  val x = sc.parallelize(Array("abcd1232018-01-01","defg4562018-02-01"))
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> val y = x.map( a => Record( a.slice(0,4), a.slice(4,4+3).toInt,java.time.LocalDate.parse(a.slice(7,7+10))))
y: org.apache.spark.rdd.RDD[Record] = MapPartitionsRDD[4] at map at <console>:27

scala> y.collect()
res3: Array[Record] = Array(Record(abcd,123,2018-01-01), Record(defg,456,2018-02-01))

scala>
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!