apache-flink | 易学教程

Using a cassandra database query as the source for a Flink program

阅读更多关于 Using a cassandra database query as the source for a Flink program

问题 I have a Cassandra database that have to receive its data in my Flink program from socket like steam for Streamprocessing. So, I wrote a simple client program that read data from Cassandra and sent the data to the socket;also,I wrote the Flink program in server base.In fact, my client program is simple and does not use any Flink instructions;it just send a Cassandra row in string format to socket and Server must receive the row. First, I run the Flink program to listen to the client and then

Kinesis Streams and Flink

阅读更多关于 Kinesis Streams and Flink

问题 I have a question regarding sharding data in a Kinesis stream. I would like to use a random partition key when sending user data to my kinesis stream so that the data in the shards is evenly distributed. For the sake of making this question simpler, I would then like to aggregate the user data by keying off of a userId in my Flink application. My question is this: if the shards are randomly partitioned so that data for one userId is spread across multiple Kinesis shards, can Flink handle

Why does Apache Flink need Watermarks for Event Time Processing?

阅读更多关于 Why does Apache Flink need Watermarks for Event Time Processing?

问题 Can someone explain Event timestamp and watermark properly. I understood it from docs, but it is not so clear. A real life example or layman definition will help. Also, if it is possible give an example ( Along with some code snippet which can explain it ).Thanks in advance 回答1: Here's an example that illustrates why we need watermarks, and how they work. In this example we have a stream of timestamped events that arrive somewhat out of order, as shown below. The numbers shown are event-time

Apache Flink: How to apply multiple counting window functions?

阅读更多关于 Apache Flink: How to apply multiple counting window functions?

问题 I have a stream of data that is keyed and need to compute counts for tumbled of different time periods (1 minute, 5 minutes, 1 day, 1 week). Is it possible to compute all four window counts in a single application? 回答1: Yes, that's possible. If you are using event-time, you can simply cascade the windows with increasing time intervals. So you do: DataStream<String> data = ... // append a Long 1 to each record to count it. DataStream<Tuple2<String, Long>> withOnes = data.map(new AppendOne);

Getting Service: AmazonKinesis; Status Code: 502 with apache-flink and localstack Kinesis

阅读更多关于 Getting Service: AmazonKinesis; Status Code: 502 with apache-flink and localstack Kinesis

问题 My local setup consists of local apache-flink (installed via brew) and localstack with the Kinesis service running. my docker-compose has localstack: image: localstack/localstack:0.10.7 environment: - SERVICES=kinesis ports: - "4568:4568" and my Kinesis Consumer: kinesisConsumerConfig.setProperty(ConsumerConfigConstants.AWS_ACCESS_KEY_ID, "123"); kinesisConsumerConfig.setProperty(ConsumerConfigConstants.AWS_SECRET_ACCESS_KEY, "123"); kinesisConsumerConfig.setProperty(ConsumerConfigConstants

Flink job distribution over cluster nodes

阅读更多关于 Flink job distribution over cluster nodes

问题 We have 4 jobs that are running over 3 nodes with 4 slots per each, On Flink 1.3.2 the jobs were evenly distributed per node. After upgrading to flink 1.5 , each job is running on a single node (with a carry over to another if there are no slots left) Is there a way to return to an even distribution? The jobs are not evenly by load which cause some nodes to work harder than other. 回答1: An answer I received from flink mailing list Re: Flink 1.5 job distribution over cluster nodes Hi Shachar,

Flink job distribution over cluster nodes

阅读更多关于 Flink job distribution over cluster nodes

Processing time windows doesn't work on finite data sources in Apache Flink

阅读更多关于 Processing time windows doesn't work on finite data sources in Apache Flink

问题 I'm trying to apply a very simple window function to a finite data stream in Apache Flink ( locally, no cluster ). Here's the example: val env = StreamExecutionEnvironment.getExecutionEnvironment env .fromCollection(List("a", "b", "c", "d", "e")) .windowAll(TumblingProcessingTimeWindows.of(Time.seconds(1))) .trigger(ProcessingTimeTrigger.create) .process(new ProcessAllWindowFunction[String, String, TimeWindow] { override def process(context: Context, elements: Iterable[String], out: Collector

Flink Kinesis Consumer not storing last successfully processed sequence nos

阅读更多关于 Flink Kinesis Consumer not storing last successfully processed sequence nos

问题 We are using Flink Kinesis Consumer to consume data from Kinesis stream into our Flink application. KCL library uses a DynamoDB table to store last successfully processed Kinesis stream sequence nos. so that the next time application starts, it resumes from where it left off. But, it seems that Flink Kinesis Consumer does not maintain any such sequence nos. in any persistent store. As a result, we need to rely upon ShardIteratortype (trim_horizen, latest, etc) to decide where to resume Flink

Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?

阅读更多关于 Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?

问题 I'm using Apache Flink's DataSet API. I want to implement a job that writes multiple results into different files. How can I do that? 回答1: You can add as many data sinks to a DataSet program as you need. For example in a program like this: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple3<String, Long, Long>> data = env.readFromCsv(...); // apply MapFunction and emit data.map(new YourMapper()).writeToText("/foo/bar"); // apply FilterFunction and emit