apache-flink

Using a cassandra database query as the source for a Flink program

不打扰是莪最后的温柔 提交于 2020-03-05 04:55:38
问题 I have a Cassandra database that have to receive its data in my Flink program from socket like steam for Streamprocessing. So, I wrote a simple client program that read data from Cassandra and sent the data to the socket;also,I wrote the Flink program in server base.In fact, my client program is simple and does not use any Flink instructions;it just send a Cassandra row in string format to socket and Server must receive the row. First, I run the Flink program to listen to the client and then

Kinesis Streams and Flink

浪尽此生 提交于 2020-03-05 00:25:26
问题 I have a question regarding sharding data in a Kinesis stream. I would like to use a random partition key when sending user data to my kinesis stream so that the data in the shards is evenly distributed. For the sake of making this question simpler, I would then like to aggregate the user data by keying off of a userId in my Flink application. My question is this: if the shards are randomly partitioned so that data for one userId is spread across multiple Kinesis shards, can Flink handle

Why does Apache Flink need Watermarks for Event Time Processing?

空扰寡人 提交于 2020-02-28 06:53:26
问题 Can someone explain Event timestamp and watermark properly. I understood it from docs, but it is not so clear. A real life example or layman definition will help. Also, if it is possible give an example ( Along with some code snippet which can explain it ).Thanks in advance 回答1: Here's an example that illustrates why we need watermarks, and how they work. In this example we have a stream of timestamped events that arrive somewhat out of order, as shown below. The numbers shown are event-time

Apache Flink: How to apply multiple counting window functions?

假装没事ソ 提交于 2020-02-26 10:05:27
问题 I have a stream of data that is keyed and need to compute counts for tumbled of different time periods (1 minute, 5 minutes, 1 day, 1 week). Is it possible to compute all four window counts in a single application? 回答1: Yes, that's possible. If you are using event-time, you can simply cascade the windows with increasing time intervals. So you do: DataStream<String> data = ... // append a Long 1 to each record to count it. DataStream<Tuple2<String, Long>> withOnes = data.map(new AppendOne);

Getting Service: AmazonKinesis; Status Code: 502 with apache-flink and localstack Kinesis

北城余情 提交于 2020-02-25 02:22:47
问题 My local setup consists of local apache-flink (installed via brew) and localstack with the Kinesis service running. my docker-compose has localstack: image: localstack/localstack:0.10.7 environment: - SERVICES=kinesis ports: - "4568:4568" and my Kinesis Consumer: kinesisConsumerConfig.setProperty(ConsumerConfigConstants.AWS_ACCESS_KEY_ID, "123"); kinesisConsumerConfig.setProperty(ConsumerConfigConstants.AWS_SECRET_ACCESS_KEY, "123"); kinesisConsumerConfig.setProperty(ConsumerConfigConstants

Flink job distribution over cluster nodes

Deadly 提交于 2020-02-24 11:11:28
问题 We have 4 jobs that are running over 3 nodes with 4 slots per each, On Flink 1.3.2 the jobs were evenly distributed per node. After upgrading to flink 1.5 , each job is running on a single node (with a carry over to another if there are no slots left) Is there a way to return to an even distribution? The jobs are not evenly by load which cause some nodes to work harder than other. 回答1: An answer I received from flink mailing list Re: Flink 1.5 job distribution over cluster nodes Hi Shachar,

Flink job distribution over cluster nodes

梦想与她 提交于 2020-02-24 11:11:10
问题 We have 4 jobs that are running over 3 nodes with 4 slots per each, On Flink 1.3.2 the jobs were evenly distributed per node. After upgrading to flink 1.5 , each job is running on a single node (with a carry over to another if there are no slots left) Is there a way to return to an even distribution? The jobs are not evenly by load which cause some nodes to work harder than other. 回答1: An answer I received from flink mailing list Re: Flink 1.5 job distribution over cluster nodes Hi Shachar,

Processing time windows doesn't work on finite data sources in Apache Flink

两盒软妹~` 提交于 2020-02-04 05:50:07
问题 I'm trying to apply a very simple window function to a finite data stream in Apache Flink ( locally, no cluster ). Here's the example: val env = StreamExecutionEnvironment.getExecutionEnvironment env .fromCollection(List("a", "b", "c", "d", "e")) .windowAll(TumblingProcessingTimeWindows.of(Time.seconds(1))) .trigger(ProcessingTimeTrigger.create) .process(new ProcessAllWindowFunction[String, String, TimeWindow] { override def process(context: Context, elements: Iterable[String], out: Collector

Flink Kinesis Consumer not storing last successfully processed sequence nos

*爱你&永不变心* 提交于 2020-02-03 16:45:11
问题 We are using Flink Kinesis Consumer to consume data from Kinesis stream into our Flink application. KCL library uses a DynamoDB table to store last successfully processed Kinesis stream sequence nos. so that the next time application starts, it resumes from where it left off. But, it seems that Flink Kinesis Consumer does not maintain any such sequence nos. in any persistent store. As a result, we need to rely upon ShardIteratortype (trim_horizen, latest, etc) to decide where to resume Flink

Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?

落爺英雄遲暮 提交于 2020-01-29 09:42:29
问题 I'm using Apache Flink's DataSet API. I want to implement a job that writes multiple results into different files. How can I do that? 回答1: You can add as many data sinks to a DataSet program as you need. For example in a program like this: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple3<String, Long, Long>> data = env.readFromCsv(...); // apply MapFunction and emit data.map(new YourMapper()).writeToText("/foo/bar"); // apply FilterFunction and emit