Why does Spark Streaming stop working when I send two input streams?

丶灬走出姿态 提交于 2019-12-08 07:16:38

问题


I am develping a Spark Streaming application in which I need to use the input streams from two servers in Python, each sending a JSON message per second to the Spark Context.

My problem is, if I perform operations on just one stream, everything works well. But if I have two streams from different servers, then Spark freezes just before it can print anything, and only starts working again when both servers have sent all the JSON messages they had to send (when it detects that 'socketTextStream is not receiving data.

Here is my code:

    JavaReceiverInputDStream<String> streamData1 = ssc.socketTextStream("localhost",996,
            StorageLevels.MEMORY_AND_DISK_SER);

    JavaReceiverInputDStream<String> streamData2 = ssc.socketTextStream("localhost", 9995,StorageLevels.MEMORY_AND_DISK_SER);

    JavaPairDStream<Integer, String> dataStream1= streamData1.mapToPair(new PairFunction<String, Integer, String>() {
        public Tuple2<Integer, String> call(String stream) throws Exception {


            Tuple2<Integer,String> streamPair= new Tuple2<Integer, String>(1, stream);

            return streamPair;
        }
    });

    JavaPairDStream<Integer, String> dataStream2= streamData2.mapToPair(new PairFunction<String, Integer, String>() {
        public Tuple2<Integer, String> call(String stream) throws Exception {


            Tuple2<Integer,String> streamPair= new Tuple2<Integer, String>(2, stream);

            return streamPair;
        }
    });

dataStream2.print(); //for example

Notice that there are no ERROR messages, Spark simple freezes after starting the context, and while I get JSON messages from the ports it doesn't show anything.

Thank you very much.


回答1:


Take a look at these caveats from the Spark Streaming documentation and see if they apply:

Points to remember

  • When running a Spark Streaming program locally, do not use “local” or “local1” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).
  • Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.


来源:https://stackoverflow.com/questions/37116079/why-does-spark-streaming-stop-working-when-i-send-two-input-streams

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!