apache-flink

What is the difference between mini-batch vs real time streaming in practice (not theory)?

不想你离开。 提交于 2019-12-20 10:46:44
问题 What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives but my biggest question is why not have mini batch with epsilon time frame (say one millisecond) or I would like to understand reason why one would be an effective solution than other? I recently came across one example where mini-batch (Apache

Apache Flink vs Apache Spark as platforms for large-scale machine learning?

╄→尐↘猪︶ㄣ 提交于 2019-12-20 08:39:50
问题 Could anyone compare Flink and Spark as platforms for machine learning? Which is potentially better for iterative algorithms? Link to the general Flink vs Spark discussion: What is the difference between Apache Spark and Apache Flink? 回答1: Disclaimer: I'm a PMC member of Apache Flink. My answer focuses on the differences of executing iterations in Flink and Spark. Apache Spark executes iterations by loop unrolling. This means that for each iteration a new set of tasks/operators is scheduled

Apache flink - job simple windowing problem - java.lang.RuntimeException: segment has been freed - Mini Cluster problem

我是研究僧i 提交于 2019-12-20 07:14:47
问题 Apache flink - job simple windowing problem - java.lang.RuntimeException: segment has been freed Hi, I am a flink newbee and in my job, I am trying to use windowing to simply aggregate elements to enable delayed processing: src = src.timeWindowAll(Time.milliseconds(1000)).process(new BaseDelayingProcessAllWindowFunctionImpl()); processwindow function simply collects input elements: public class BaseDelayingProcessAllWindowFunction<IN> extends ProcessAllWindowFunction<IN, IN, TimeWindow> {

Why does Flink SQL use a cardinality estimate of 100 rows for all tables?

旧时模样 提交于 2019-12-20 03:01:27
问题 I wasn't sure why the logical plan wasn't correctly evaluated in this example. I looked more deeply in the Flink base code and I checked that when calcite evaluate/estimate the number of rows for the query in object. For some reason it returns always 100 for any table source . In Flink in fact, during the process of the program plan creation, for each transformed rule it is called the VolcanoPlanner class by the TableEnvironment.runVolcanoPlanner. The planner try to optimise and calculate

Ordering of Records in Stream

梦想的初衷 提交于 2019-12-19 11:50:31
问题 Here are some of the queries I have : I have two different streams stream1 and stream2 in which the elements are in order. 1) Now when I do keyBy on each of these streams, will the order be maintained? (Since every group here will be sent to one task manager only ) My understanding is that the records will be in order for a group, correct me here. 2) After the keyBy on both of the streams I am doing co-group to get the matching and non-matching records. Will the order be maintained here also?

Apache Flink: Count window with timeout

雨燕双飞 提交于 2019-12-19 08:26:13
问题 Here is a simple code example to illustrate my question: case class Record( key: String, value: Int ) object Job extends App { val env = StreamExecutionEnvironment.getExecutionEnvironment val data = env.fromElements( Record("01",1), Record("02",2), Record("03",3), Record("04",4), Record("05",5) ) val step1 = data.filter( record => record.value % 3 != 0 ) // introduces some data loss val step2 = data.map( r => Record( r.key, r.value * 2 ) ) val step3 = data.map( r => Record( r.key, r.value * 3

apache flink 0.10 how to get the first occurence of a composite key from an unbounded input dataStream?

折月煮酒 提交于 2019-12-18 13:27:24
问题 i am a newbie with apache flink. i have an unbound data stream in my input (fed into flink 0.10 via kakfa). i want to get the 1st occurence of each primary key (the primary key is the contract_num and the event_dt). These "duplicates" occur nearly immediately after each other. The source system cannot filter this for me, so flink has to do it. Here is my input data: contract_num, event_dt, attr A1, 2016-02-24 10:25:08, X A1, 2016-02-24 10:25:08, Y A1, 2016-02-24 10:25:09, Z A2, 2016-02-24 10

Input of apache_beam.examples.wordcount

℡╲_俬逩灬. 提交于 2019-12-18 09:36:41
问题 I was trying to run the beam Python-SDK example, but I had problem in reading the input. https://cwiki.apache.org/confluence/display/BEAM/Usage+Guide#UsageGuide-RunaPython-SDKPipeline when I used gs://dataflow-samples/shakespeare/kinglear.txt as the input, the error was apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions {'gs://dataflow-samples/shakespeare/kinglear.txt': TypeError("__init__() got an unexpected keyword argument 'response_encoding'",)} when I used my

Error with Kerberos authentication when executing Flink example code on YARN cluster (Cloudera)

半世苍凉 提交于 2019-12-14 04:01:14
问题 I was trying Flink on YARN cluster to run the example code (flink examples WordCount.jar) but am getting the below security authentication error. org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Cannot initialize task 'DataSink (CsvOutputFormat (path: hdfs://10.94.146.126:8020/user/qawsbtch/flink_out, delimiter: ))': SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS] I am not sure where the issue is and what is that I am missing to do. I

Kafka consumer in flink

Deadly 提交于 2019-12-14 03:59:04
问题 I am working with kafka and apache flink. I am trying to consume records (which are in avro format) from a kafka topic in apache flink. Below is the piece of code I am trying with. Using a custom deserialiser to deserialise avro records from the topic. the Avro schema for the data I am sending to topic "test-topic" is as below. { "namespace": "com.example.flink.avro", "type": "record", "name": "UserInfo", "fields": [ {"name": "name", "type": "string"} ] } The custom deserialiser I am using is