apache-flink | 易学教程

Flink: how to store state and use in another stream?

阅读更多关于 Flink: how to store state and use in another stream?

问题 I have a use-case for Flink where I need to read information from a file, store each line, and then use this state to filter another stream. I have all of this working right now with the connect operator and a RichCoFlatMapFunction , but it feels overly complicated. Also, I'm concerned that flatMap2 could begin executing before all of the state is loaded from the file: fileStream .connect(partRecordStream.keyBy((KeySelector<PartRecord, String>) partRecord -> partRecord.getPartId())) .keyBy(

How to handle errors in custom MapFunction correctly?

阅读更多关于 How to handle errors in custom MapFunction correctly?

问题 I have implemented MapFunction for my Apache Flink flow. It is parsing incoming elements and convert them to other format but sometimes error can appear (i.e. incoming data is not valid). I see two possible ways how to handle it: Ignore invalid elements but seems like I can't ignore errors because for any incoming element I must provide outgoing element. Split incoming elements to valid and invalid but seems like I should use other function for this. So, I have two questions: How to handle

How to support multiple KeyBy in Flink

阅读更多关于 How to support multiple KeyBy in Flink

问题 In code sample below, I am trying to get a stream of employee records {Country, Employer, Name, Salary, Age } and dumping highest paid employee in every country. Unfortunately Multiple KEY By doesn't work. Only KeyBy(Employer) is reflecting, thus I don't get correct result. What am I missing? StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<Employee> streamEmployee = env.addSource( new FlinkKafkaConsumer010<ObjectNode>("flink-demo", new

Apache Flink: Where do State Backends keep the state?

阅读更多关于 Apache Flink: Where do State Backends keep the state?

问题 I got a statement below: "Depending on your state backend, Flink can also manage the state for the application, meaning Flink deals with the memory management (possibly spilling to disk if necessary) to allow applications to hold very large state." https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/state/state_backends.html Does it mean that only when the state backends is configured to RocksDBStateBackend , the state would keep in memory and possibly spilling to disk if

Tensorflow transform on beams with flink runner

阅读更多关于 Tensorflow transform on beams with flink runner

It may seem stupid but it is my very first post here. Sorry for doing anything wrong. I am currently building a simple ML pipeline with TFX 0.11 (i.e. tfdv-tft-tfserving) and tensorflow 1.11, using python2.7. I currently have a apache-flink cluster and I want to use that for TFX. I know the framework behind TFX is apache-beams 2.8, and it (apache-beams) supports flink with python SDK currently through a portable runner layer. But the problem is how I can code in TFX (tfdv-tft) using apache-beams with flink runner through this portable runner concept, as TFX currently seems to only support

Why “broadcast state” can store the dynamic rules however broadcast() operator cannot?

阅读更多关于 Why “broadcast state” can store the dynamic rules however broadcast() operator cannot?

问题 I got confused for the difference between "broadcast state" and broadcast() operator, and finally I got the help from a Flink expert in the following thread. What does it mean that "broadcast state" unblocks the implementation of the “dynamic patterns” feature for Flink’s CEP library? In the end it seems got the conclusion that "broadcast state" can store the dynamic rules in the keyed stream by RichCoFlatMap , however broadcast() operator cannot, so may I know how "broadcast state" store the

Reading from multiple broker kafka with flink

阅读更多关于 Reading from multiple broker kafka with flink

问题 I want to read multiple kafka from flink. I have a cluser of 3 computers for kafka. With the following topic Topic:myTopic PartitionCount:3 ReplicationFactor:1 Configs: Topic: myTopic Partition: 0 Leader: 2 Replicas: 2 Isr: 2 Topic: myTopic Partition: 1 Leader: 0 Replicas: 0 Isr: 0 Topic: myTopic Partition: 2 Leader: 1 Replicas: 1 Isr: 1 From Flink I execute the following code : Properties properties = new Properties(); properties.setProperty("bootstrap.servers", "x.x.x.x:9092,x.x.x.x:9092,x

Calculate totals and emit periodically in flink

阅读更多关于 Calculate totals and emit periodically in flink

问题 I have a stream of events about resources that looks like this: id, type, count 1, view, 1 1, download, 3 2, view, 1 3, view, 1 1, download, 2 3, view, 1 I am trying to produce stats (totals) per resource, so if I get a stream like above, the result should be: id, views, downloads 1, 1, 5 2, 1, 0 3, 2, 0 Now I wrote a ProcessFunction that calculates the totals like this: public class CountTotals extends ProcessFunction<Event, ResourceTotals> { private ValueState<ResourceTotals> totalsState;

How to sort the union datastream of flink without watermark

阅读更多关于 How to sort the union datastream of flink without watermark

问题 The flink flow has multi data stream, then I merge those data stream with org.apache.flink.streaming.api.datastream.DataStream#union method. Then, I got the problem, the datastream is disordered and I can not set window to sort the data in data stream. Sorting union of streams to identify user sessions in Apache Flink I got the the answer, but the com.liam.learn.flink.example.union.UnionStreamDemo.SortFunction#onTimer never been invoked. Environment Info: flink version 1.7.0 In general, I

Flink: join file with kafka stream

阅读更多关于 Flink: join file with kafka stream

问题 I have a problem I don't really can figure out. So I have a kafka stream that contains some data like this: {"adId":"9001", "eventAction":"start", "eventType":"track", "eventValue":"", "timestamp":"1498118549550"} And I want to replace 'adId' with another value 'bookingId'. This value is located in a csv file, but I can't really figure out how to get it working. Here is my mapping csv file: 9001;8 9002;10 So my output would ideally be something like {"bookingId":"8", "eventAction":"start",