问题:

I have a spark job in Structured Streaming that consumes data from Kafka and saves it to InfluxDB. I have implemented the connection pooling mechanism as follows:

object InfluxConnectionPool {       val queue = new LinkedBlockingQueue[InfluxDB]()        def initialize(database: String): Unit = {         while (!isConnectionPoolFull) {           queue.put(createNewConnection(database))         }       }        private def isConnectionPoolFull: Boolean = {         val MAX_POOL_SIZE = 1000         if (queue.size < MAX_POOL_SIZE)           false         else           true       }        def getConnectionFromPool: InfluxDB = {         if (queue.size > 0) {           val connection = queue.take()           connection         } else {           System.err.println("InfluxDB connection limit reached. ");           null         }        }        private def createNewConnection(database: String) = {         val influxDBUrl = "..."         val influxDB = InfluxDBFactory.connect(...)         influxDB.enableBatch(10, 100, TimeUnit.MILLISECONDS)         influxDB.setDatabase(database)         influxDB.setRetentionPolicy(database + "_rp")         influxDB       }        def returnConnectionToPool(connection: InfluxDB): Unit = {         queue.put(connection)       }     }

In my spark job, I do the following

def run(): Unit = {  val spark = SparkSession   .builder   .appName("ETL JOB")   .master("local[4]")   .getOrCreate()    ...   // This is where I create connection pool InfluxConnectionPool.initialize("dbname")  val sdvWriter = new ForeachWriter[record] {   var influxDB:InfluxDB = _    def open(partitionId: Long, version: Long): Boolean = {     influxDB = InfluxConnectionPool.getConnectionFromPool     true   }   def process(record: record) = {     // this is where I use the connection object and save the data     MyService.saveData(influxDB, record.topic, record.value)     InfluxConnectionPool.returnConnectionToPool(influxDB)   }   def close(errorOrNull: Throwable): Unit = {   } }  import spark.implicits._ import org.apache.spark.sql.functions._  //Read data from kafka val kafkaStreamingDF = spark   .readStream   ....  val sdvQuery = kafkaStreamingDF   .writeStream   .foreach(sdvWriter)   .start()   }

But, when I run the job, I get the following exception

18/05/07 00:00:43 ERROR StreamExecution: Query [id = 6af3c096-7158-40d9-9523-13a6bffccbb8, runId = 3b620d11-9b93-462b-9929-ccd2b1ae9027] terminated with error     org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 8, 192.168.222.5, executor 1): java.lang.NullPointerException         at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:332)         at com.abc.telemetry.app.influxdb.InfluxConnectionPool$.returnConnectionToPool(InfluxConnectionPool.scala:47)         at com.abc.telemetry.app.ETLappSave$$anon$1.process(ETLappSave.scala:55)         at com.abc.telemetry.app.ETLappSave$$anon$1.process(ETLappSave.scala:46)         at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:53)         at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)

The NPE is when the connection is returned to the connection pool in queue.put(connection). What am I missing here? Any help appreciated.

P.S: In the regular DStreams approach, I did it with foreachPartition method. Not sure how to do connection reuse/pooling with structured streaming.

回答1:

datasetOfString.writeStream.foreach(new ForeachWriter[String] {       def open(partitionId: Long, version: Long): Boolean = {         // open connection       }       def process(record: String) = {         // write string to connection       }       def close(errorOrNull: Throwable): Unit = {         // close the connection       }     })

From the docs of ForeachWriter,

Each task will get a fresh serialized-deserialized copy of the provided object

So whatever you initialize outside the ForeachWriter will run only at the driver.

You need to initialize the connection pool and open the connection in the open method.

转载请标明出处:Spark connection pooling - Is this the right approach

文章来源: Spark connection pooling - Is this the right approach

标签

spark

InfluxDB