Spark connection pooling - Is this the right approach

匿名 (未验证) 提交于 2019-12-03 09:10:12

问题:

I have a spark job in Structured Streaming that consumes data from Kafka and saves it to InfluxDB. I have implemented the connection pooling mechanism as follows:

object InfluxConnectionPool {       val queue = new LinkedBlockingQueue[InfluxDB]()        def initialize(database: String): Unit = {         while (!isConnectionPoolFull) {           queue.put(createNewConnection(database))         }       }        private def isConnectionPoolFull: Boolean = {         val MAX_POOL_SIZE = 1000         if (queue.size < MAX_POOL_SIZE)           false         else           true       }        def getConnectionFromPool: InfluxDB = {         if (queue.size > 0) {           val connection = queue.take()           connection         } else {           System.err.println("InfluxDB connection limit reached. ");           null         }        }        private def createNewConnection(database: String) = {         val influxDBUrl = "..."         val influxDB = InfluxDBFactory.connect(...)         influxDB.enableBatch(10, 100, TimeUnit.MILLISECONDS)         influxDB.setDatabase(database)         influxDB.setRetentionPolicy(database + "_rp")         influxDB       }        def returnConnectionToPool(connection: InfluxDB): Unit = {         queue.put(connection)       }     }

In my spark job, I do the following

def run(): Unit = {  val spark = SparkSession   .builder   .appName("ETL JOB")   .master("local[4]")   .getOrCreate()    ...   // This is where I create connection pool InfluxConnectionPool.initialize("dbname")  val sdvWriter = new ForeachWriter[record] {   var influxDB:InfluxDB = _    def open(partitionId: Long, version: Long): Boolean = {     influxDB = InfluxConnectionPool.getConnectionFromPool     true   }   def process(record: record) = {     // this is where I use the connection object and save the data     MyService.saveData(influxDB, record.topic, record.value)     InfluxConnectionPool.returnConnectionToPool(influxDB)   }   def close(errorOrNull: Throwable): Unit = {   } }  import spark.implicits._ import org.apache.spark.sql.functions._  //Read data from kafka val kafkaStreamingDF = spark   .readStream   ....  val sdvQuery = kafkaStreamingDF   .writeStream   .foreach(sdvWriter)   .start()   }

But, when I run the job, I get the following exception

18/05/07 00:00:43 ERROR StreamExecution: Query [id = 6af3c096-7158-40d9-9523-13a6bffccbb8, runId = 3b620d11-9b93-462b-9929-ccd2b1ae9027] terminated with error     org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 8, 192.168.222.5, executor 1): java.lang.NullPointerException         at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:332)         at com.abc.telemetry.app.influxdb.InfluxConnectionPool$.returnConnectionToPool(InfluxConnectionPool.scala:47)         at com.abc.telemetry.app.ETLappSave$$anon$1.process(ETLappSave.scala:55)         at com.abc.telemetry.app.ETLappSave$$anon$1.process(ETLappSave.scala:46)         at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:53)         at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)

The NPE is when the connection is returned to the connection pool in queue.put(connection). What am I missing here? Any help appreciated.

P.S: In the regular DStreams approach, I did it with foreachPartition method. Not sure how to do connection reuse/pooling with structured streaming.

回答1:

datasetOfString.writeStream.foreach(new ForeachWriter[String] {       def open(partitionId: Long, version: Long): Boolean = {         // open connection       }       def process(record: String) = {         // write string to connection       }       def close(errorOrNull: Throwable): Unit = {         // close the connection       }     })

From the docs of ForeachWriter,

Each task will get a fresh serialized-deserialized copy of the provided object

So whatever you initialize outside the ForeachWriter will run only at the driver.

You need to initialize the connection pool and open the connection in the open method.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!