Apache Flink

聊聊flink DataStream的window coGroup操作

你。 提交于 2019-12-03 08:06:18
序 本文主要研究一下flink DataStream的window coGroup操作 实例 dataStream.coGroup(otherStream) .where(0).equalTo(1) .window(TumblingEventTimeWindows.of(Time.seconds(3))) .apply (new CoGroupFunction () {...}); 这里展示了DataStream的window coGroup操作的基本用法 DataStream.coGroup flink-streaming-java_2.11-1.7.0-sources.jar!/org/apache/flink/streaming/api/datastream/DataStream.java @Public public class DataStream<T> { //...... public <T2> CoGroupedStreams<T, T2> coGroup(DataStream<T2> otherStream) { return new CoGroupedStreams<>(this, otherStream); } //...... } DataStream的coGroup操作创建的是CoGroupedStreams CoGroupedStreams flink

聊聊flink的FencedAkkaInvocationHandler

北慕城南 提交于 2019-12-03 08:05:51
序 本文主要研究一下flink的FencedAkkaInvocationHandler FencedRpcGateway flink-release-1.7.2/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/FencedRpcGateway.java public interface FencedRpcGateway<F extends Serializable> extends RpcGateway { /** * Get the current fencing token. * * @return current fencing token */ F getFencingToken(); } FencedRpcGateway接口继承了RpcGateway接口,它定义一个泛型F,即为fencing token的泛型 FencedMainThreadExecutable flink-release-1.7.2/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/FencedMainThreadExecutable.java public interface FencedMainThreadExecutable extends

聊聊flink LocalEnvironment的execute方法

喜欢而已 提交于 2019-12-03 08:05:31
序 本文主要研究一下flink LocalEnvironment的execute方法 实例 final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<RecordDto> csvInput = env.readCsvFile(csvFilePath) .pojoType(RecordDto.class, "playerName", "country", "year", "game", "gold", "silver", "bronze", "total"); DataSet<Tuple2<String, Integer>> groupedByCountry = csvInput .flatMap(new FlatMapFunction<RecordDto, Tuple2<String, Integer>>() { private static final long serialVersionUID = 1L; @Override public void flatMap(RecordDto record, Collector<Tuple2<String, Integer>> out) throws Exception { out.collect(new Tuple2

聊聊flink StreamOperator的initializeState方法

穿精又带淫゛_ 提交于 2019-12-02 07:42:55
序 本文主要研究一下flink StreamOperator的initializeState方法 Task.run flink-runtime_2.11-1.7.0-sources.jar!/org/apache/flink/runtime/taskmanager/Task.java public class Task implements Runnable, TaskActions, CheckpointListener { public void run() { // ---------------------------- // Initial State transition // ---------------------------- while (true) { ExecutionState current = this.executionState; if (current == ExecutionState.CREATED) { if (transitionState(ExecutionState.CREATED, ExecutionState.DEPLOYING)) { // success, we can start our work break; } } else if (current == ExecutionState.FAILED) { // we

聊聊flink的logback配置

一曲冷凌霜 提交于 2019-12-02 07:42:46
序 本文主要研究一下flink的logback配置 client端pom文件配置 <dependencies> <!-- Add the two required logback dependencies --> <dependency> <groupId>ch.qos.logback</groupId> <artifactId>logback-core</artifactId> <version>1.2.3</version> </dependency> <dependency> <groupId>ch.qos.logback</groupId> <artifactId>logback-classic</artifactId> <version>1.2.3</version> </dependency> <!-- Add the log4j -> sfl4j (-> logback) bridge into the classpath Hadoop is logging to log4j! --> <dependency> <groupId>org.slf4j</groupId> <artifactId>log4j-over-slf4j</artifactId> <version>1.7.15</version> </dependency> <dependency>

聊聊flink JobManager的heap大小设置

99封情书 提交于 2019-12-02 07:42:32
序 本文主要研究一下flink JobManager的heap大小设置 JobManagerOptions flink-core-1.7.1-sources.jar!/org/apache/flink/configuration/JobManagerOptions.java @PublicEvolving public class JobManagerOptions { //...... /** * JVM heap size for the JobManager with memory size. */ @Documentation.CommonOption(position = Documentation.CommonOption.POSITION_MEMORY) public static final ConfigOption<String> JOB_MANAGER_HEAP_MEMORY = key("jobmanager.heap.size") .defaultValue("1024m") .withDescription("JVM heap size for the JobManager."); /** * JVM heap size (in megabytes) for the JobManager. * @deprecated use {@link #JOB

聊聊flink JobManager的High Availability

ぃ、小莉子 提交于 2019-12-02 07:36:32
序 本文主要研究一下flink JobManager的High Availability 配置 flink-conf.yaml high-availability: zookeeper high-availability.zookeeper.quorum: zookeeper:2181 high-availability.zookeeper.path.root: /flink high-availability.cluster-id: /cluster_one # important: customize per cluster high-availability.storageDir: file:///share high-availability的可选值为NONE或者zookeeper;high-availability.zookeeper.quorum用于指定zookeeper的peers;high-availability.zookeeper.path.root用于指定在zookeeper的root node路径;high-availability.cluster-id用于指定当前cluster的node名称,该cluster node位于root node下面;high-availability.storageDir用于指定JobManager metadata的存储路径

聊聊flink的AbstractNonHaServices

天涯浪子 提交于 2019-12-02 07:36:10
序 本文主要研究一下flink的AbstractNonHaServices HighAvailabilityServices flink-runtime_2.11-1.7.1-sources.jar!/org/apache/flink/runtime/highavailability/HighAvailabilityServices.java public interface HighAvailabilityServices extends AutoCloseable { // ------------------------------------------------------------------------ // Constants // ------------------------------------------------------------------------ /** * This UUID should be used when no proper leader election happens, but a simple * pre-configured leader is used. That is for example the case in non-highly-available * standalone setups. */

spark streaming、flink和storm区别浅析

我的未来我决定 提交于 2019-12-02 07:00:48
1. 介绍 这三个计算框架常常被拿来比较。从我的角度来看,三者的比较可以分为两类(mini-batches vs. streaming)。spark streaming属于微批量的伪流式准实时计算框架(spark本身属于批处理框架)。而flink和storm则作为典型的实时流处理框架。 2. spark vs flink 两者虽然有很多设计实现思路上比较接近以及互相学习,但是主要区别还是mini-batch和streaming的选择上。根据实际场景在吞吐量和实时性上做权衡。 3. flink vs storm 名称 批处理 数据处理保证 api level 容错机制 storm 不支持 at least once(实现采用record-level acknowledgments),Trident可以支持storm 提供exactly once语义 low record-level acknowledgments flink 支持 exactly once(实现采用Chandy-Lamport 算法,即marker-checkpoint ) high marker-checkpoint 4. 其他资料 5. 总结 我写的比较简略。强烈建议看看下面我罗列的参考资料,都写的很不错。 参考资料: What is/are the main difference(s) between

大数据处理引擎Spark与Flink对比分析!

冷暖自知 提交于 2019-12-02 07:00:30
大数据技术正飞速地发展着,催生出一代又一代快速便捷的大数据处理引擎,无论是Hadoop、Storm,还是后来的Spark、Flink。然而,毕竟没有哪一个框架可以完全支持所有的应用场景,也就说明不可能有任何一个框架可以完全取代另一个。今天,大圣众包威客平台( www.dashengzb.cn )将从几个项出发着重对比Spark与Flink这两个大数据处理引擎,探讨其两者的区别。   一、Spark与Flink几个主要项目的对比与分析   1.性能对比   测试环境:   CPU:7000个   内存:单机128GB   版本:Hadoop 2.3.0,Spark 1.4,Flink 0.9   数据:800MB,8GB,8TB   算法:K-means:以空间中K个点为中心进行聚类,对最靠近它们的对象归类,通过迭代的方法,逐次更新各聚类中心的值,直至得到最好的聚类结果   迭代:K=10,3组数据   相同点:Spark与Flink都运行在Hadoop YARN上,两者都拥有非常好的计算性能,因为两者都可以基于内存计算框架以进行实时计算。   相异点:结合上图三者的迭代次数(纵坐标是秒,横坐标是次数)图表观察,可得出在性能上,呈现Flink > Spark > Hadoop(MR)的结果,且迭代次数越多越明显。Flink之所以优于Spark和Hadoop