What is the correct way of using memSQL Connection object inside call method of Apache Spark code

一个人想着一个人 提交于 2019-12-11 10:49:58

问题


I have a spark code where the code inside Call method makes call to the memSQL database for reading from a table. My code opens a new connection object each time and closes it after the task is done. This call is made from inside the Call method. This works fine but the execution time for Spark job becomes high. What would be a better way to do this so that the spark code execution time is reduced.

Thank You.


回答1:


You can use one connection per partition, like this:

rdd.foreachPartition {records =>
  val connection = DB.createConnection()
  //you can use your connection instance inside foreach
  records.foreach { r=>
    val externalData = connection.read(r.externaId)
    //do something with your data
  }
  DB.save(records)
  connection.close()
}

If you use Spark Streaming:

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { records =>
    val connection = DB.createConnection()
    //you can use your connection instance inside foreach
    records.foreach { r=>
      val externalData = connection.read(r.externaId)
      //do something with your data
    }
    DB.save(records)
    connection.close()
  }
}

See http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams



来源:https://stackoverflow.com/questions/36837588/what-is-the-correct-way-of-using-memsql-connection-object-inside-call-method-of

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!