Apache Flink | 易学教程

聊聊flink的Queryable State

阅读更多关于聊聊flink的Queryable State

序本文主要研究一下flink的Queryable State 实例 Job @Test public void testValueStateForQuery() throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment .createRemoteEnvironment("192.168.99.100", 8081, SubmitTest.JAR_FILE); env.addSource(new RandomTuple2Source()) .keyBy(0) //key by first value of tuple .flatMap(new CountWindowAverage()) .print(); JobExecutionResult result = env.execute("testQueryableState"); LOGGER.info("submit job result:{}",result); } 这里运行一个job，它对tuple的第一个值作为key，然后flatMap操作使用的是CountWindowAverage CountWindowAverage public class CountWindowAverage extends

Flink SQL 功能解密系列 —— 流计算“撤回(Retraction)”案例分析

阅读更多关于 Flink SQL 功能解密系列 —— 流计算“撤回(Retraction)”案例分析

摘要：通俗讲retract就是传统数据里面的更新操作，也就是说retract是流式计算场景下对数据更新的处理方式。什么是retraction（撤回）通俗讲retract就是传统数据里面的更新操作，也就是说retract是流式计算场景下对数据更新的处理方式。首先来看下流场景下的一个词频统计列子。没有retract会导致最终结果不正确↑： retract发挥的作用下面再分享两个双十一期间retract保证数据正确性的业务case: case1：菜鸟物流订单统计同一个订单的商品在运输过程中，因为各种原因，物流公司是有可能从A变成B的。为了统计物流公司承担的订单数目，菜鸟团队使用blink计算的retraction机制进行变key汇总操作。 -- TT source_table 数据如下： order_id tms_company 0001 中通 0002 中通 0003 圆通 -- SQL代码 create view dwd_table as select order_id, StringLast(tms_company) from source_table group by order_id; create view dws_table as select tms_company, count(distinct order_id) as order_cnt from

聊聊flink的ActorGateway

阅读更多关于聊聊flink的ActorGateway

序本文主要研究一下flink的ActorGateway ActorGateway flink-1.7.2/flink-runtime/src/main/java/org/apache/flink/runtime/instance/ActorGateway.java public interface ActorGateway extends Serializable { /** * Sends a message asynchronously and returns its response. The response to the message is * returned as a future. * * @param message Message to be sent * @param timeout Timeout until the Future is completed with an AskTimeoutException * @return Future which contains the response to the sent message */ Future<Object> ask(Object message, FiniteDuration timeout); /** * Sends a message asynchronously without a

开源 | 全球首个批流一体机器学习平台 Alink

阅读更多关于开源 | 全球首个批流一体机器学习平台 Alink

背景随着大数据时代的到来和人工智能的崛起，机器学习所能处理的场景更加广泛和多样。构建的模型需要对批量数据进行处理，为了达到实时性的要求还需要直接对流式数据进行实时预测，还要具备将模型应用在企业应用和微服务上能力。为了取得更好的业务效果，算法工程师们需要尝试更多更复杂的模型，需要处理更大的数据集，使用分布式集群已经成为常态；为了及时对市场的变化进行反应，越来越多的业务选用在线学习方式直接处理流式数据、实时更新模型。我们团队一直从事算法平台的研发工作，感受到了高效能的算法组件和便捷操作平台对开发者的帮助。针对正在兴起的机器学习广泛而多样的应用场景，我们在2017年开始基于Flink研发新一代的机器学习算法平台，使得数据分析和应用开发人员能够轻松搭建端到端的业务流程。项目名称定为Alink，取自相关名称（Alibaba, Algorithm, AI, Flink, Blink）的公共部分。什么是 Alink ？ Alink 是阿里巴巴计算平台事业部PAI团队从 2017 年开始基于实时计算引擎 Flink 研发的新一代机器学习算法平台，提供丰富的算法组件库和便捷的操作框架，开发者可以一键搭建覆盖数据处理、特征工程、模型训练、模型预测的算法模型开发全流程。借助Flink在批流一体化方面的优势，Alink能够为批流任务提供一致性的操作。在实践过程中

聊聊flink的Managed Keyed State

阅读更多关于聊聊flink的Managed Keyed State

序本文主要研究一下flink的Managed Keyed State State flink-core-1.7.0-sources.jar!/org/apache/flink/api/common/state/State.java /** * Interface that different types of partitioned state must implement. * * <p>The state is only accessible by functions applied on a {@code KeyedStream}. The key is * automatically supplied by the system, so the function always sees the value mapped to the * key of the current element. That way, the system can handle stream and state partitioning * consistently together. */ @PublicEvolving public interface State { /** * Removes the value mapped under the current key. */ void

聊聊flink的TableFactory

阅读更多关于聊聊flink的TableFactory

序本文主要研究一下flink的TableFactory 实例 class MySystemTableSourceFactory implements StreamTableSourceFactory<Row> { @Override public Map<String, String> requiredContext() { Map<String, String> context = new HashMap<>(); context.put("update-mode", "append"); context.put("connector.type", "my-system"); return context; } @Override public List<String> supportedProperties() { List<String> list = new ArrayList<>(); list.add("connector.debug"); return list; } @Override public StreamTableSource<Row> createStreamTableSource(Map<String, String> properties) { boolean isDebug = Boolean.valueOf(properties.get(

《从0到1学习Flink》—— Data Source 介绍

阅读更多关于《从0到1学习Flink》—— Data Source 介绍

前言 Data Sources 是什么呢？就字面意思其实就可以知道：数据来源。 Flink 做为一款流式计算框架，它可用来做批处理，即处理静态的数据集、历史的数据集；也可以用来做流处理，即实时的处理些实时数据流，实时的产生数据流结果，只要数据源源不断的过来，Flink 就能够一直计算下去，这个 Data Sources 就是数据的来源地。 Flink 中你可以使用 StreamExecutionEnvironment.addSource(sourceFunction) 来为你的程序添加数据来源。 Flink 已经提供了若干实现好了的 source functions，当然你也可以通过实现 SourceFunction 来自定义非并行的 source 或者实现 ParallelSourceFunction 接口或者扩展 RichParallelSourceFunction 来自定义并行的 source， Flink StreamExecutionEnvironment 中可以使用以下几个已实现的 stream sources，总的来说可以分为下面几大类：基于集合 1、fromCollection(Collection) - 从 Java 的 Java.util.Collection 创建数据流。集合中的所有元素类型必须相同。 2、fromCollection(Iterator,

聊聊flink的StateDescriptor

阅读更多关于聊聊flink的StateDescriptor

序本文主要研究一下flink的StateDescriptor RuntimeContext.getState flink-core-1.7.0-sources.jar!/org/apache/flink/api/common/functions/RuntimeContext.java /** * A RuntimeContext contains information about the context in which functions are executed. Each parallel instance * of the function will have a context through which it can access static contextual information (such as * the current parallelism) and other constructs like accumulators and broadcast variables. * * <p>A function can, during runtime, obtain the RuntimeContext via a call to * {@link AbstractRichFunction#getRuntimeContext()}. */ @Public

聊聊flink的AbstractTtlState

阅读更多关于聊聊flink的AbstractTtlState

序本文主要研究一下flink的AbstractTtlState InternalKvState flink-runtime_2.11-1.7.0-sources.jar!/org/apache/flink/runtime/state/internal/InternalKvState.java /** * The {@code InternalKvState} is the root of the internal state type hierarchy, similar to the * {@link State} being the root of the public API state hierarchy. * * <p>The internal state classes give access to the namespace getters and setters and access to * additional functionality, like raw value access or state merging. * * <p>The public API state hierarchy is intended to be programmed against by Flink applications. * The internal state

开源 | 全球首个批流一体机器学习平台 Alink

阅读更多关于开源 | 全球首个批流一体机器学习平台 Alink

订阅 Apache Flink