apache-flink

Flink: How to pass extra JVM options to TaskManager and JobManager

我与影子孤独终老i 提交于 2019-12-02 03:22:41
问题 I am trying to submit flink job on yarn using below command: /usr/flink-1.3.2/bin/flink run -yd -yn 1 -ynm MyApp -ys 1 -yqu default -m yarn-cluster -c com.mycompany.Driver -j /usr/myapp.jar -Denv.java.opts="-Dzkconfig.parent /app-config_127.0.0.1 -Dzk.hosts localhost:2181 -Dsax.zookeeper.root /app" I got the env.java.opts on flink client log but when the application gets submitted to Yarn, these Java options wont be available. Due to unavailability of extra JVM options, application throws

How to concatenate two streams in Apache Flink

 ̄綄美尐妖づ 提交于 2019-12-02 03:16:38
E.g. i want to compose stream of 1, 2, 3 and 4, 5 in single one, so result should be: 1, 2, 3, 4, 5 . In other words: if first source is exhausted - get elements from second one. My closest attempt, which unfortunately does not preserve items order , is: val a = streamEnvironment.fromElements(1, 2, 3) val b = streamEnvironment.fromElements(4, 5) val c = a.union(b) c.map(x => println(s"X=$x")) // X=4, 5, 1, 2, 3 or something like that Also did similar attempt with datetime included, but with same result. This is not possible right now, at least not with the high level DataStream API. It might

flink InputStream of class class org.apache.commons.compress.archivers.zip.ZipFile$1 is not implementing InputStreamStatistics

假如想象 提交于 2019-12-02 02:53:24
I was trying to load an excel into POI workbook in a Flink program. Has an error like this. Caused by: java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipFile$1 is not implementing InputStreamStatistics. at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:63) at org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:147) at org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:34) at org.apache.poi.openxml4j.util.ZipFileZipEntrySource

Apache Flink: Count window with timeout

大憨熊 提交于 2019-12-02 02:34:22
Here is a simple code example to illustrate my question: case class Record( key: String, value: Int ) object Job extends App { val env = StreamExecutionEnvironment.getExecutionEnvironment val data = env.fromElements( Record("01",1), Record("02",2), Record("03",3), Record("04",4), Record("05",5) ) val step1 = data.filter( record => record.value % 3 != 0 ) // introduces some data loss val step2 = data.map( r => Record( r.key, r.value * 2 ) ) val step3 = data.map( r => Record( r.key, r.value * 3 ) ) val merged = step1.union( step2, step3 ) val keyed = merged.keyBy(0) val windowed = keyed

Read data from Cassandra for processing in Flink

两盒软妹~` 提交于 2019-12-01 20:42:07
I have to process data streams from Kafka using Flink as the streaming engine. To do the analysis on the data, I need to query some tables in Cassandra. What is the best way to do this? I have been looking for examples in Scala for such cases. But I couldn't find any.How can data from Cassandra be read in Flink using Scala as the programming language? Read & write data into cassandra using apache flink Java API has another question on the same lines. It has multiple approaches mentioned in the answers. I would like to know what is the best approach in my case. Also, most of the examples

Difference between job, task and subtask in flink

蓝咒 提交于 2019-12-01 18:41:58
I'm new to flink and try to understand: job task subtask I searched in the docs but still did not get it. What's the main diffence between them? Tasks and sub-tasks are explained here -- https://ci.apache.org/projects/flink/flink-docs-release-1.7/concepts/runtime.html#tasks-and-operator-chains : A task is an abstraction representing a chain of operators that could be executed in a single thread. Something like a keyBy (which causes a network shuffle to partition the stream by some key) or a change in the parallelism of the pipeline will break the chaining and force operators into separate

Why flink container vcore size is always 1

半世苍凉 提交于 2019-12-01 14:12:40
I am running flink on yarn(more precisely in AWS EMR yarn cluster). I read flink document and source code that by default for each task manager container, flink will request the number of slot per task manager as the number of vcores when request resource from yarn. And I also confirmed from the source code: // Resource requirements for worker containers int taskManagerSlots = taskManagerParameters.numSlots(); int vcores = config.getInteger(ConfigConstants.YARN_VCORES, Math.max(taskManagerSlots, 1)); Resource capability = Resource.newInstance(containerMemorySizeMB, vcores);

Apache Flink: How to count the total number of events in a DataStream

白昼怎懂夜的黑 提交于 2019-12-01 13:11:00
I have two raw streams and I am joining those streams and then I want to count what is the total number of events that have been joined and how much events have not. I am doing this by using map on joinedEventDataStream as shown below joinedEventDataStream.map(new RichMapFunction<JoinedEvent, Object>() { @Override public Object map(JoinedEvent joinedEvent) throws Exception { number_of_joined_events += 1; return null; } }); Question # 1: Is this the appropriate way to count the number of events in the stream? Question # 2: I have noticed a wired behavior, which some of you might not believe.

Ordering of Records in Stream

女生的网名这么多〃 提交于 2019-12-01 12:40:27
Here are some of the queries I have : I have two different streams stream1 and stream2 in which the elements are in order. 1) Now when I do keyBy on each of these streams, will the order be maintained? (Since every group here will be sent to one task manager only ) My understanding is that the records will be in order for a group, correct me here. 2) After the keyBy on both of the streams I am doing co-group to get the matching and non-matching records. Will the order be maintained here also?, since this also works on KeyedStream . I am using EventTime , and AscendingTimestampExtractor for

How to omit a null value exception in flink-kafka , Any help would do

血红的双手。 提交于 2019-12-01 12:16:38
问题 I'm trying to make a code that creates alert when temperature is above threshold temperature (as defined in the code), but keyed stream is creating problem. I'm new to flink and intermediate in scala. I need help in this code I've tried almost everything def main(args: Array[String]): Unit = { val TEMPERATURE_THRESHOLD: Double = 50.00 val see: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val properties = new Properties() properties.setProperty("bootstrap