Flink Kafka - how to make App run in Parallel?

被刻印的时光 ゝ 提交于 2019-12-10 11:17:49

问题


I am creating a app in Flink to

  1. Read Messages from a topic
  2. Do some simple process on it
  3. Write Result to a different topic

My code does work, however it does not run in parallel
How do I do that?
It seems my code runs only on one thread/block?

On the Flink Web Dashboard:

  • App goes to running status
  • But, there is only one block shown in the overview subtasks
  • And Bytes Received / Sent, Records Received / Sent is always zero ( no Update )

Here is my code, please assist me in learning how to split my app to be able to run in parallel, and am I writing the app correctly?

public class SimpleApp {

    public static void main(String[] args) throws Exception {

        // create execution environment INPUT
        StreamExecutionEnvironment env_in  =    
                 StreamExecutionEnvironment.getExecutionEnvironment();
        // event time characteristic
        env_in.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        // production Ready (Does NOT Work if greater than 1)
        env_in.setParallelism(Integer.parseInt(args[0].toString()));

        // configure kafka consumer
        Properties properties = new Properties();
        properties.setProperty("zookeeper.connect", "localhost:2181");
        properties.setProperty("bootstrap.servers", "localhost:9092");
        properties.setProperty("auto.offset.reset", "earliest");

        // create a kafka consumer
        final DataStream<String> consumer = env_in
                .addSource(new FlinkKafkaConsumer09<>("test", new   
                            SimpleStringSchema(), properties));

        // filter data
        SingleOutputStreamOperator<String> result = consumer.filter(new  
            FilterFunction<String>(){
            @Override
            public boolean filter(String s) throws Exception {
                return s.substring(0, 2).contentEquals("PS");
            }
        });

        // Process Data
        // Transform String Records to JSON Objects
        SingleOutputStreamOperator<JSONObject> data = result.map(new 
                MapFunction<String, JSONObject>()
        {
            @Override
            public JSONObject map(String value) throws Exception
            {
                JSONObject jsnobj = new JSONObject();

                if(value.substring(0, 2).contentEquals("PS"))
                {
                    // 1. Raw Data
                    jsnobj.put("Raw_Data", value.substring(0, value.length()-6));

                    // 2. Comment
                    int first_index_comment = value.indexOf("$");
                    int last_index_comment  = value.lastIndexOf("$") + 1;
                    //   - set comment
                    String comment          =  
                    value.substring(first_index_comment, last_index_comment);
                    comment = comment.substring(0, comment.length()-6);
                    jsnobj.put("Comment", comment);
                }
                else {
                    jsnobj.put("INVALID", value);
                }

                return jsnobj;
            }
        });

        // Write JSON to Kafka Topic
        data.addSink(new FlinkKafkaProducer09<JSONObject>("localhost:9092",
                "FilteredData",
                new SimpleJsonSchema()));

        env_in.execute();
    }
}

My code does work, but it seems to run only on a single thread ( One block shown ) in web interface ( No passing of data, hence the bytes sent / received are not updated ).

How do I make it run in parallel ?


回答1:


To run your job in parallel you can do 2 things:

  1. Increase the parallelism of your job at the env level - i.e. do something like

StreamExecutionEnvironment env_in = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(4);

But this would only increase parallelism at flink end after it reads the data, so if the source is producing data faster it might not be fully utilized.

  1. To fully parallelize your job, setup multiple partitions for your kafka topic, ideally the amount of parallelism you would want with your flink job. So, you might want to do something like below when you are creating your kafka topic:

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 4 --topic test



来源:https://stackoverflow.com/questions/46338574/flink-kafka-how-to-make-app-run-in-parallel

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!