apache-flink

IOExcpetion while connecting to Twitter Streaming API with Apache Flink

别等时光非礼了梦想. 提交于 2019-12-10 16:38:21
问题 I wrote a small Scala program which uses the Apache Flink Streaming API to read Twitter tweets. object TwitterWordCount { private val properties = "/home/twitter-login.properties" def main(args: Array[String]) { val env = StreamExecutionEnvironment.getExecutionEnvironment val twitterStream = env.addSource(new TwitterSource(properties)) val tweets = twitterStream .flatMap(new JSONParseFlatMap[String, String] { override def flatMap(in: String, out: Collector[String]): Unit = { if (getString(in,

How to build and use flink-connector-kinesis?

丶灬走出姿态 提交于 2019-12-10 16:34:11
问题 I'm trying to use Apache Flink with AWS kinesis. The document says that I have to build the connector on my own. Therefore, I build the connector and added the jar file for my project and also, I put the dependency on my pom.xml file. <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kinesis_2.11</artifactId> <version>1.6.1</version> </dependency> However, when I tried to build using mvn clean package I got an error message like this [INFO] -----------------------<

Flink dynamic scaling

℡╲_俬逩灬. 提交于 2019-12-10 15:51:31
问题 I am currently studying scalability on Flink. Starting from Version 1.2.0, dynamic rescaling was introduced. I am looking at scaling a long running job which reads data from Kafka source. Questions regarding dynamic rescaling. To scale out my flink application, for example: add new task managers, must I restart the job / yarn session to use the newly added resource? I think it's possible to write Yarn client to deploy new task managers and make it talk to job manager, is that already

Apache Flink: ProcessWindowFunction implementation

三世轮回 提交于 2019-12-10 15:36:57
问题 I am trying to use a ProcessWindowFunction in my Apache Flink project using Scala. Unfortunately, I already fail at implementing a basic ProcessWindowFunction like it is used in the Apache Flink Documentation. This is my code: import org.apache.flink.streaming.api.scala._ import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _} import org.apache.flink.streaming.api.windowing.time.Time import org.fiware.cosmos.orion.flink.connector.{NgsiEvent, OrionSource} import org.apache

zipWithIndex on Apache Flink

陌路散爱 提交于 2019-12-10 14:32:00
问题 I'd like to assign each row of my input an id - which should be a number from 0 to N - 1 , where N is the number of rows in the input. Roughly, I'd like to be able to do something like the following : val data = sc.textFile(textFilePath, numPartitions) val rdd = data.map(line => process(line)) val rddMatrixLike = rdd.zipWithIndex.map { case (v, idx) => someStuffWithIndex(idx, v) } But in Apache Flink. Is it possible? 回答1: This is now a part of the 0.10-SNAPSHOT release of Apache Flink.

Apache Flink: NullPointerException caused by TupleSerializer

本秂侑毒 提交于 2019-12-10 13:31:43
问题 When I execute my Flink application it gives me this NullPointerException : 2017-08-08 13:21:57,690 INFO com.datastax.driver.core.Cluster - New Cassandra host /127.0.0.1:9042 added 2017-08-08 13:22:02,427 INFO org.apache.flink.runtime.taskmanager.Task - TriggerWindow(TumblingEventTimeWindows(30000), ListStateDescriptor{serializer=org.apache.flink.api.common.typeutils.base.ListSerializer@15d1c80b}, EventTimeTrigger(), WindowedStream.apply(CoGroupedStreams.java:302)) -> Filter -> Flat Map ->

Measure job execution time in flink

和自甴很熟 提交于 2019-12-10 13:15:54
问题 Is there any way to measure job execution time in Apache Flink when submitting the job to flink using command line? PS. I want the flink API to give me the time rather than measuring it myself in bash by noting the start and end times 回答1: The ExecutionEnvironment.execute() method returns a JobExecutionResult object containing the job runtime. You could for example do something like this: // execute program JobExecutionResult result = env.execute("My Flink Job"); System.out.println("The job

Degree of parallelism in Apache Flink

跟風遠走 提交于 2019-12-10 12:43:15
问题 Can I set different degree of parallelism for different part of the task in our program in Flink? For instance, how does Flink interpret the following sample code? The two custom practitioners MyPartitioner1, MyPartitioner2, partition the input data two 4 and 2 partitions. partitionedData1 = inputData1 .partitionCustom(new MyPartitioner1(), 1); env.setParallelism(4); DataSet<Tuple2<Integer, Integer>> output1 = partitionedData1 .mapPartition(new calculateFun()); partitionedData2 = inputData2

Flink Kafka - how to make App run in Parallel?

被刻印的时光 ゝ 提交于 2019-12-10 11:17:49
问题 I am creating a app in Flink to Read Messages from a topic Do some simple process on it Write Result to a different topic My code does work , however it does not run in parallel How do I do that? It seems my code runs only on one thread/block? On the Flink Web Dashboard: App goes to running status But, there is only one block shown in the overview subtasks And Bytes Received / Sent, Records Received / Sent is always zero ( no Update ) Here is my code, please assist me in learning how to split

How to increase Flink taskmanager.numberOfTaskSlots to run it without Flink server(in IDE or fat jar)

耗尽温柔 提交于 2019-12-10 09:28:37
问题 I have one questions about running Flink streaming job in IDE or as fat jar without deploying it to Flink server. The problem is I cannot run it in IDE when I have more than 1 taskslot in my job. public class StreamingJob { public static void main(String[] args) throws Exception { // set up the streaming execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); Properties kafkaProperties = new Properties(); kafkaProperties.setProperty(