spark-streaming and connection pool implementation

后端 未结 2 2021
时光说笑
时光说笑 2020-12-02 11:49

The spark-streaming website at https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams mentions the following code:



        
2条回答
  •  慢半拍i
    慢半拍i (楼主)
    2020-12-02 12:13

    Below answer is wrong! I'm leaving the answer here for reference, but the answer is wrong for the following reason. socketPool is declared as a lazy val so it will get instantiated with each first request for access. Since the SocketPool case class is not Serializable, this means that it will get instantiated within each partition. Which makes the connection pool useless because we want to keep connections across partitions and RDDs. It makes no difference wether this is implemented as a companion object or as a case class. Bottom line is: the connection pool must be Serializable, and apache commons pool is not.

    import java.io.PrintStream
    import java.net.Socket
    
    import org.apache.commons.pool2.{PooledObject, BasePooledObjectFactory}
    import org.apache.commons.pool2.impl.{DefaultPooledObject, GenericObjectPool}
    import org.apache.spark.streaming.dstream.DStream
    
    /**
     * Publish a Spark stream to a socket.
     */
    class PooledSocketStreamPublisher[T](host: String, port: Int)
      extends Serializable {
    
        lazy val socketPool = SocketPool(host, port)
    
        /**
         * Publish the stream to a socket.
         */
        def publishStream(stream: DStream[T], callback: (T) => String) = {
            stream.foreachRDD { rdd =>
    
                rdd.foreachPartition { partition =>
    
                    val socket = socketPool.getSocket
                    val out = new PrintStream(socket.getOutputStream)
    
                    partition.foreach { event =>
                        val text : String = callback(event)
                        out.println(text)
                        out.flush()
                    }
    
                    out.close()
                    socketPool.returnSocket(socket)
    
                }
            }
        }
    
    }
    
    class SocketFactory(host: String, port: Int) extends BasePooledObjectFactory[Socket] {
    
        def create(): Socket = {
            new Socket(host, port)
        }
    
        def wrap(socket: Socket): PooledObject[Socket] = {
            new DefaultPooledObject[Socket](socket)
        }
    
    }
    
    case class SocketPool(host: String, port: Int) {
    
        val socketPool = new GenericObjectPool[Socket](new SocketFactory(host, port))
    
        def getSocket: Socket = {
            socketPool.borrowObject
        }
    
        def returnSocket(socket: Socket) = {
            socketPool.returnObject(socket)
        }
    
    }
    

    which you can invoke as follows:

    val socketStreamPublisher = new PooledSocketStreamPublisher[MyEvent](host = "10.10.30.101", port = 29009)
    socketStreamPublisher.publishStream(myEventStream, (e: MyEvent) => Json.stringify(Json.toJson(e)))
    

提交回复
热议问题