Spark connection pooling - Is this the right approach

问题

I have a Spark job in Structured Streaming that consumes data from Kafka and saves it to InfluxDB. I have implemented the connection pooling mechanism as follows:

object InfluxConnectionPool {
      val queue = new LinkedBlockingQueue[InfluxDB]()

      def initialize(database: String): Unit = {
        while (!isConnectionPoolFull) {
          queue.put(createNewConnection(database))
        }
      }

      private def isConnectionPoolFull: Boolean = {
        val MAX_POOL_SIZE = 1000
        if (queue.size < MAX_POOL_SIZE)
          false
        else
          true
      }

      def getConnectionFromPool: InfluxDB = {
        if (queue.size > 0) {
          val connection = queue.take()
          connection
        } else {
          System.err.println("InfluxDB connection limit reached. ");
          null
        }

      }

      private def createNewConnection(database: String) = {
        val influxDBUrl = "..."
        val influxDB = InfluxDBFactory.connect(...)
        influxDB.enableBatch(10, 100, TimeUnit.MILLISECONDS)
        influxDB.setDatabase(database)
        influxDB.setRetentionPolicy(database + "_rp")
        influxDB
      }

      def returnConnectionToPool(connection: InfluxDB): Unit = {
        queue.put(connection)
      }
    }

In my spark job, I do the following

def run(): Unit = {

val spark = SparkSession
  .builder
  .appName("ETL JOB")
  .master("local[4]")
  .getOrCreate()


 ...

 // This is where I create connection pool
InfluxConnectionPool.initialize("dbname")

val sdvWriter = new ForeachWriter[record] {
  var influxDB:InfluxDB = _

  def open(partitionId: Long, version: Long): Boolean = {
    influxDB = InfluxConnectionPool.getConnectionFromPool
    true
  }
  def process(record: record) = {
    // this is where I use the connection object and save the data
    MyService.saveData(influxDB, record.topic, record.value)
    InfluxConnectionPool.returnConnectionToPool(influxDB)
  }
  def close(errorOrNull: Throwable): Unit = {
  }
}

import spark.implicits._
import org.apache.spark.sql.functions._

//Read data from kafka
val kafkaStreamingDF = spark
  .readStream
  ....

val sdvQuery = kafkaStreamingDF
  .writeStream
  .foreach(sdvWriter)
  .start()
  }

But, when I run the job, I get the following exception

18/05/07 00:00:43 ERROR StreamExecution: Query [id = 6af3c096-7158-40d9-9523-13a6bffccbb8, runId = 3b620d11-9b93-462b-9929-ccd2b1ae9027] terminated with error
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 8, 192.168.222.5, executor 1): java.lang.NullPointerException
        at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:332)
        at com.abc.telemetry.app.influxdb.InfluxConnectionPool$.returnConnectionToPool(InfluxConnectionPool.scala:47)
        at com.abc.telemetry.app.ETLappSave$$anon$1.process(ETLappSave.scala:55)
        at com.abc.telemetry.app.ETLappSave$$anon$1.process(ETLappSave.scala:46)
        at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:53)
        at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)

The NPE is when the connection is returned to the connection pool in queue.put(connection). What am I missing here? Any help appreciated.

P.S: In the regular DStreams approach, I did it with foreachPartition method. Not sure how to do connection reuse/pooling with structured streaming.

回答1:

I am using the forEachWriter for redis similarly, where the pool is being referenced in the process only. Your request would look something like below

def open(partitionId: Long, version: Long): Boolean = {
    true
  }

  def process(record: record) = {
    influxDB = InfluxConnectionPool.getConnectionFromPool
    // this is where I use the connection object and save the data
    MyService.saveData(influxDB, record.topic, record.value)
    InfluxConnectionPool.returnConnectionToPool(influxDB)
  }```

回答2:

datasetOfString.writeStream.foreach(new ForeachWriter[String] {
      def open(partitionId: Long, version: Long): Boolean = {
        // open connection
      }
      def process(record: String) = {
        // write string to connection
      }
      def close(errorOrNull: Throwable): Unit = {
        // close the connection
      }
    })

From the docs of ForeachWriter,

Each task will get a fresh serialized-deserialized copy of the provided object

So whatever you initialize outside the ForeachWriter will run only at the driver.

You need to initialize the connection pool and open the connection in the open method.

来源：https://stackoverflow.com/questions/50205650/spark-connection-pooling-is-this-the-right-approach

标签

scala

apache-spark

connection-pooling

spark-structured-streaming