Flink Kafka - how to make App run in Parallel?

问题

I am creating a app in Flink to

Read Messages from a topic
Do some simple process on it
Write Result to a different topic

My code does work, however it does not run in parallel
How do I do that?
It seems my code runs only on one thread/block?

On the Flink Web Dashboard:

App goes to running status
But, there is only one block shown in the overview subtasks
And Bytes Received / Sent, Records Received / Sent is always zero ( no Update )

Here is my code, please assist me in learning how to split my app to be able to run in parallel, and am I writing the app correctly?

public class SimpleApp {

    public static void main(String[] args) throws Exception {

        // create execution environment INPUT
        StreamExecutionEnvironment env_in  =    
                 StreamExecutionEnvironment.getExecutionEnvironment();
        // event time characteristic
        env_in.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        // production Ready (Does NOT Work if greater than 1)
        env_in.setParallelism(Integer.parseInt(args[0].toString()));

        // configure kafka consumer
        Properties properties = new Properties();
        properties.setProperty("zookeeper.connect", "localhost:2181");
        properties.setProperty("bootstrap.servers", "localhost:9092");
        properties.setProperty("auto.offset.reset", "earliest");

        // create a kafka consumer
        final DataStream<String> consumer = env_in
                .addSource(new FlinkKafkaConsumer09<>("test", new   
                            SimpleStringSchema(), properties));

        // filter data
        SingleOutputStreamOperator<String> result = consumer.filter(new  
            FilterFunction<String>(){
            @Override
            public boolean filter(String s) throws Exception {
                return s.substring(0, 2).contentEquals("PS");
            }
        });

        // Process Data
        // Transform String Records to JSON Objects
        SingleOutputStreamOperator<JSONObject> data = result.map(new 
                MapFunction<String, JSONObject>()
        {
            @Override
            public JSONObject map(String value) throws Exception
            {
                JSONObject jsnobj = new JSONObject();

                if(value.substring(0, 2).contentEquals("PS"))
                {
                    // 1. Raw Data
                    jsnobj.put("Raw_Data", value.substring(0, value.length()-6));

                    // 2. Comment
                    int first_index_comment = value.indexOf("$");
                    int last_index_comment  = value.lastIndexOf("$") + 1;
                    //   - set comment
                    String comment          =  
                    value.substring(first_index_comment, last_index_comment);
                    comment = comment.substring(0, comment.length()-6);
                    jsnobj.put("Comment", comment);
                }
                else {
                    jsnobj.put("INVALID", value);
                }

                return jsnobj;
            }
        });

        // Write JSON to Kafka Topic
        data.addSink(new FlinkKafkaProducer09<JSONObject>("localhost:9092",
                "FilteredData",
                new SimpleJsonSchema()));

        env_in.execute();
    }
}

My code does work, but it seems to run only on a single thread ( One block shown ) in web interface ( No passing of data, hence the bytes sent / received are not updated ).

How do I make it run in parallel ?

回答1:

To run your job in parallel you can do 2 things:

Increase the parallelism of your job at the env level - i.e. do something like

StreamExecutionEnvironment env_in = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(4);

But this would only increase parallelism at flink end after it reads the data, so if the source is producing data faster it might not be fully utilized.

To fully parallelize your job, setup multiple partitions for your kafka topic, ideally the amount of parallelism you would want with your flink job. So, you might want to do something like below when you are creating your kafka topic:

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 4 --topic test

来源：https://stackoverflow.com/questions/46338574/flink-kafka-how-to-make-app-run-in-parallel

标签

java

parallel-processing

apache-kafka

apache-flink