问题
I am develping a Spark Streaming application in which I need to use the input streams from two servers in Python, each sending a JSON message per second to the Spark Context.
My problem is, if I perform operations on just one stream, everything works well. But if I have two streams from different servers, then Spark freezes just before it can print anything, and only starts working again when both servers have sent all the JSON messages they had to send (when it detects that 'socketTextStream
is not receiving data.
Here is my code:
JavaReceiverInputDStream<String> streamData1 = ssc.socketTextStream("localhost",996,
StorageLevels.MEMORY_AND_DISK_SER);
JavaReceiverInputDStream<String> streamData2 = ssc.socketTextStream("localhost", 9995,StorageLevels.MEMORY_AND_DISK_SER);
JavaPairDStream<Integer, String> dataStream1= streamData1.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String stream) throws Exception {
Tuple2<Integer,String> streamPair= new Tuple2<Integer, String>(1, stream);
return streamPair;
}
});
JavaPairDStream<Integer, String> dataStream2= streamData2.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String stream) throws Exception {
Tuple2<Integer,String> streamPair= new Tuple2<Integer, String>(2, stream);
return streamPair;
}
});
dataStream2.print(); //for example
Notice that there are no ERROR messages, Spark simple freezes after starting the context, and while I get JSON messages from the ports it doesn't show anything.
Thank you very much.
回答1:
Take a look at these caveats from the Spark Streaming documentation and see if they apply:
Points to remember
- When running a Spark Streaming program locally, do not use “local” or “local1” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).
- Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.
来源:https://stackoverflow.com/questions/37116079/why-does-spark-streaming-stop-working-when-i-send-two-input-streams