apache-flink

Example of raw vs managed state

烈酒焚心 提交于 2021-02-20 03:39:32
问题 I am trying to understand the difference between raw and managed state. From the docs: Keyed State and Operator State exist in two forms: managed and raw. Managed State is represented in data structures controlled by the Flink runtime, such as internal hash tables, or RocksDB. Examples are “ValueState”, “ListState”, etc. Flink’s runtime encodes the states and writes them into the checkpoints. Raw State is state that operators keep in their own data structures. When checkpointed, they only

Example of raw vs managed state

僤鯓⒐⒋嵵緔 提交于 2021-02-20 03:35:47
问题 I am trying to understand the difference between raw and managed state. From the docs: Keyed State and Operator State exist in two forms: managed and raw. Managed State is represented in data structures controlled by the Flink runtime, such as internal hash tables, or RocksDB. Examples are “ValueState”, “ListState”, etc. Flink’s runtime encodes the states and writes them into the checkpoints. Raw State is state that operators keep in their own data structures. When checkpointed, they only

How does Flink scale for hot partitions?

ε祈祈猫儿з 提交于 2021-02-20 02:46:54
问题 If I have a use case where I need to join two streams or aggregate some kind of metrics from a single stream, and I use keyed streams to partition the events, how does Flink handle the operations for hot partitions where the data might not fit into memory and needs to be split across partitions? 来源: https://stackoverflow.com/questions/66273158/how-does-flink-scale-for-hot-partitions

flink program behaves differently in parallelism

こ雲淡風輕ζ 提交于 2021-02-19 08:55:07
问题 I am using Flink 1.4.1 and I am using CEP. I have to calculate lifetime order amount by the same user in each order. So when I sending orders Order A -> amount: 500, Order B -> amount: 200, Order C -> amount: 300 and calculating key by the user using states. Sometime in Order B, it's showing 700 and sometimes 200. Means sometimes it's adding order A in B, sometimes not. I am running code in 6 parallelisms. Is this parallelism issue or distributed state issue? When I run the whole program with

Flink Checkpoint Failure - Checkpoints time out after 10 mins

不问归期 提交于 2021-02-19 04:25:07
问题 We got one or two CheckPoint Failure during processing data every day. The data volume is low, like under 10k, and our interval setting is '2 minutes'. (The reason for processing very slow is we need to sink the data to another API endpoint which take some time to process at the end of flink job, so the time is Streaming data + Sink to external API endpoint). The root issue is: Checkpoints time out after 10 mins, this caused by the data processing time longer than 10 mins, so the checkpoint

Flink taskmanager out of memory and memory configuration

馋奶兔 提交于 2021-02-18 17:37:09
问题 We are using Flink streaming to run a few jobs on a single cluster. Our jobs are using rocksDB to hold a state. The cluster is configured to run with a single Jobmanager and 3 Taskmanager on 3 separate VMs. Each TM is configured to run with 14GB of RAM. JM is configured to run with 1GB. We are experiencing 2 memory related issues: - When running Taskmanager with 8GB heap allocation, the TM ran out of heap memory and we got heap out of memory exception. Our solution to this problem was

How to unit test BroadcastProcessFunction in flink when processElement depends on broadcasted data

坚强是说给别人听的谎言 提交于 2021-02-17 02:29:12
问题 I implemented a flink stream with a BroadcastProcessFunction. From the processBroadcastElement I get my model and I apply it on my event in processElement. I don't find a way to unit test my stream as I don't find a solution to ensure the model is dispatched prior to the first event. I would say there are two ways for achieving this: 1. Find a solution to have the model pushed in the stream first 2. Have the broadcast state filled with the model prio to the execution of the stream so that it

How to sort a dataset in Apache Flink?

你离开我真会死。 提交于 2021-02-16 16:48:10
问题 I have a Tuple Dataset of the form DataSet>. I wish to sort the "entire" Dataset on field String and then get only the Long values in a file. Flink does provide sort-partition but that does not help here as I need to sort the Dataset completely. 回答1: You can also use sortPartition() to sort the complete DataSet if you set the parallelism to 1 : DataSet<Tuple2<String, Long>> data = ... DataSet<Tuple2<String, Long>> sorted = data .sortPartition(0, Order.ASCENDING).setParallelism(1); // sort in

Using Broadcast State To Force Window Closure Using Fake Messages

China☆狼群 提交于 2021-02-11 15:31:47
问题 Description: Currently I am working on using Flink with an IOT setup. Essentially, devices are sending data such as (device_id, device_type, event_timestamp, etc) and I don't have any control over when the messages get sent. I then key the steam by device_id and device_type to preform aggregations. I would like to use event-time given that is ensures the timers which are set trigger in a deterministic nature given a failure. However, given that this isn't always a high throughput stream a

Using Broadcast State To Force Window Closure Using Fake Messages

最后都变了- 提交于 2021-02-11 15:30:32
问题 Description: Currently I am working on using Flink with an IOT setup. Essentially, devices are sending data such as (device_id, device_type, event_timestamp, etc) and I don't have any control over when the messages get sent. I then key the steam by device_id and device_type to preform aggregations. I would like to use event-time given that is ensures the timers which are set trigger in a deterministic nature given a failure. However, given that this isn't always a high throughput stream a