apache-flink | 易学教程

Example of raw vs managed state

阅读更多关于 Example of raw vs managed state

问题 I am trying to understand the difference between raw and managed state. From the docs: Keyed State and Operator State exist in two forms: managed and raw. Managed State is represented in data structures controlled by the Flink runtime, such as internal hash tables, or RocksDB. Examples are “ValueState”, “ListState”, etc. Flink’s runtime encodes the states and writes them into the checkpoints. Raw State is state that operators keep in their own data structures. When checkpointed, they only

Example of raw vs managed state

阅读更多关于 Example of raw vs managed state

How does Flink scale for hot partitions?

阅读更多关于 How does Flink scale for hot partitions?

问题 If I have a use case where I need to join two streams or aggregate some kind of metrics from a single stream, and I use keyed streams to partition the events, how does Flink handle the operations for hot partitions where the data might not fit into memory and needs to be split across partitions? 来源： https://stackoverflow.com/questions/66273158/how-does-flink-scale-for-hot-partitions

flink program behaves differently in parallelism

阅读更多关于 flink program behaves differently in parallelism

问题 I am using Flink 1.4.1 and I am using CEP. I have to calculate lifetime order amount by the same user in each order. So when I sending orders Order A -> amount: 500, Order B -> amount: 200, Order C -> amount: 300 and calculating key by the user using states. Sometime in Order B, it's showing 700 and sometimes 200. Means sometimes it's adding order A in B, sometimes not. I am running code in 6 parallelisms. Is this parallelism issue or distributed state issue? When I run the whole program with

Flink Checkpoint Failure - Checkpoints time out after 10 mins

阅读更多关于 Flink Checkpoint Failure - Checkpoints time out after 10 mins

问题 We got one or two CheckPoint Failure during processing data every day. The data volume is low, like under 10k, and our interval setting is '2 minutes'. (The reason for processing very slow is we need to sink the data to another API endpoint which take some time to process at the end of flink job, so the time is Streaming data + Sink to external API endpoint). The root issue is: Checkpoints time out after 10 mins, this caused by the data processing time longer than 10 mins, so the checkpoint

Flink taskmanager out of memory and memory configuration

阅读更多关于 Flink taskmanager out of memory and memory configuration

问题 We are using Flink streaming to run a few jobs on a single cluster. Our jobs are using rocksDB to hold a state. The cluster is configured to run with a single Jobmanager and 3 Taskmanager on 3 separate VMs. Each TM is configured to run with 14GB of RAM. JM is configured to run with 1GB. We are experiencing 2 memory related issues: - When running Taskmanager with 8GB heap allocation, the TM ran out of heap memory and we got heap out of memory exception. Our solution to this problem was

How to unit test BroadcastProcessFunction in flink when processElement depends on broadcasted data

阅读更多关于 How to unit test BroadcastProcessFunction in flink when processElement depends on broadcasted data

问题 I implemented a flink stream with a BroadcastProcessFunction. From the processBroadcastElement I get my model and I apply it on my event in processElement. I don't find a way to unit test my stream as I don't find a solution to ensure the model is dispatched prior to the first event. I would say there are two ways for achieving this: 1. Find a solution to have the model pushed in the stream first 2. Have the broadcast state filled with the model prio to the execution of the stream so that it

How to sort a dataset in Apache Flink?

阅读更多关于 How to sort a dataset in Apache Flink?

问题 I have a Tuple Dataset of the form DataSet>. I wish to sort the "entire" Dataset on field String and then get only the Long values in a file. Flink does provide sort-partition but that does not help here as I need to sort the Dataset completely. 回答1: You can also use sortPartition() to sort the complete DataSet if you set the parallelism to 1 : DataSet<Tuple2<String, Long>> data = ... DataSet<Tuple2<String, Long>> sorted = data .sortPartition(0, Order.ASCENDING).setParallelism(1); // sort in

Using Broadcast State To Force Window Closure Using Fake Messages

阅读更多关于 Using Broadcast State To Force Window Closure Using Fake Messages

问题 Description: Currently I am working on using Flink with an IOT setup. Essentially, devices are sending data such as (device_id, device_type, event_timestamp, etc) and I don't have any control over when the messages get sent. I then key the steam by device_id and device_type to preform aggregations. I would like to use event-time given that is ensures the timers which are set trigger in a deterministic nature given a failure. However, given that this isn't always a high throughput stream a

Using Broadcast State To Force Window Closure Using Fake Messages

阅读更多关于 Using Broadcast State To Force Window Closure Using Fake Messages