apache-flink | 易学教程

Flink: How to handle external app configuration changes in flink

阅读更多关于 Flink: How to handle external app configuration changes in flink

问题 My requirement is to stream millions of records in a day and it has huge dependency on external configuration parameters. For example, a user can go and change the required setting anytime in the web application and after the change is made, the streaming has to happen with the new application config parameters. These are app level configurations and we also have some dynamic exclude parameters which each data has to be passed through and filtered. I see that flink doesn’t have global state

Input of apache_beam.examples.wordcount

阅读更多关于 Input of apache_beam.examples.wordcount

I was trying to run the beam Python-SDK example, but I had problem in reading the input. https://cwiki.apache.org/confluence/display/BEAM/Usage+Guide#UsageGuide-RunaPython-SDKPipeline when I used gs://dataflow-samples/shakespeare/kinglear.txt as the input, the error was apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions {'gs://dataflow-samples/shakespeare/kinglear.txt': TypeError("__init__() got an unexpected keyword argument 'response_encoding'",)} when I used my local file, it seemed it didn't actually read the file, and output nothing. The result didn't include

Flink Scala API “not enough arguments”

阅读更多关于 Flink Scala API “not enough arguments”

I'm having troubles using Apache Flink Scala API For example, even when I take the examples from the official documentation, the scala compiler gives me tons of compilation errors. Code: object TestFlink { def main(args: Array[String]) { val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements( "Who's there?", "I think I hear them. Stand, ho! Who's there?") val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } } .map { (_, 1) } .groupBy(0) .sum(1) counts.print() env.execute("Scala WordCount Example") } } Scala IDE outputs the following for the

How to combine streaming data with large history data set in Dataflow/Beam

阅读更多关于 How to combine streaming data with large history data set in Dataflow/Beam

问题 I am investigating processing logs from web user sessions via Google Dataflow/Apache Beam and need to combine the user's logs as they come in (streaming) with the history of a user's session from the last month. I have looked at the following approaches: Use a 30 day fixed window: most likely to large of a window to fit into memory, and I do not need to update the user's history, just refer to it Use CoGroupByKey to join two data sets, but the two data sets must have the same window size

How to use multi-thread consumer in kafka 0.9.0?

阅读更多关于 How to use multi-thread consumer in kafka 0.9.0?

The doc of kafka give an approach about with following describes: One Consumer Per Thread:A simple option is to give each thread its own consumer > instance. My code: public class KafkaConsumerRunner implements Runnable { private final AtomicBoolean closed = new AtomicBoolean(false); private final CloudKafkaConsumer consumer; private final String topicName; public KafkaConsumerRunner(CloudKafkaConsumer consumer, String topicName) { this.consumer = consumer; this.topicName = topicName; } @Override public void run() { try { this.consumer.subscribe(topicName); ConsumerRecords<String, String>

flink keyBy adding delay; how can I reduce this latency?

阅读更多关于 flink keyBy adding delay; how can I reduce this latency?

问题 When I ran a simple flink application with KeyedStream, I observed the time latency of an event varies from 0 to 100ms. Below is the program StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<Long> source = env.addSource(new SourceFunction<Long>() { public void run(SourceContext<Long> sourceContext) throws Exception { while(true) { synchronized (sourceContext.getCheckpointLock()) { sourceContext.collect(System.currentTimeMillis()); Thread.sleep

How to sort a stream by event time using Flink SQL

阅读更多关于 How to sort a stream by event time using Flink SQL

问题 I have an out-of-order DataStream<Event> that I want to sort so that the events are ordered by their event time timestamps. I've simplified my use case down to where my Event class has just a single field -- the timestamp field: public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env); env.setStreamTimeCharacteristic

Flink Scala API “not enough arguments”

阅读更多关于 Flink Scala API “not enough arguments”

问题 I'm having troubles using Apache Flink Scala API For example, even when I take the examples from the official documentation, the scala compiler gives me tons of compilation errors. Code: object TestFlink { def main(args: Array[String]) { val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements( "Who's there?", "I think I hear them. Stand, ho! Who's there?") val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } } .map { (_, 1) } .groupBy(0)

What is/are the main difference(s) between Flink and Storm?

阅读更多关于 What is/are the main difference(s) between Flink and Storm?

Flink has been compared to Spark , which, as I see it, is the wrong comparison because it compares a windowed event processing system against micro-batching; Similarly, it does not make that much sense to me to compare Flink to Samza. In both cases it compares a real-time vs. a batched event processing strategy, even if at a smaller "scale" in the case of Samza. But I would like to know how Flink compares to Storm, which seems conceptually much more similar to it. I have found this (Slide #4) documenting the main difference as "adjustable latency" for Flink. Another hint seems to be an article

How to use multi-thread consumer in kafka 0.9.0?

阅读更多关于 How to use multi-thread consumer in kafka 0.9.0?

问题 The doc of kafka give an approach about with following describes: One Consumer Per Thread:A simple option is to give each thread its own consumer > instance. My code: public class KafkaConsumerRunner implements Runnable { private final AtomicBoolean closed = new AtomicBoolean(false); private final CloudKafkaConsumer consumer; private final String topicName; public KafkaConsumerRunner(CloudKafkaConsumer consumer, String topicName) { this.consumer = consumer; this.topicName = topicName; }