udf

最好用的工兵铲—MaxCompute Studio,来了解下!

我的未来我决定 提交于 2019-11-30 05:35:33
摘要: 在大数据计算北京高端峰会上,阿里云计算平台高级专家薛明深入介绍了阿里巴巴大数据计算平台开发利器—MaxCompute Studio。一站式的 IDE,可以快速完成数据浏览和管理、进行基于 SQL 和 UDF 的数据开发,更具备完善的作业分析和优化辅助等功能。本文带领大家自浅入深了解MaxCompute Studio。 演讲嘉宾简介: 薛明,MaxCompute高级技术专家。 以下内容根据演讲嘉宾视频分享以及PPT 整理而成。 PPT材料下载地址 : https://yq.aliyun.com/download/2729 视频地址: https://edu.aliyun.com/lesson_1010_8793?spm=5176.10731542.0.0.ZRTXdt#_8793 产品地址: https://www.aliyun.com/product/odps 工欲善其事,必先利其器。为了享受大数据计算服务(MaxCompute)带来的特性,通常需要数据分析师或数据开发者将数据的价值挖掘出来,在这过程中,我们需要借助一些工具,例如MaxCompute Studio,Dataworks等。通俗的来讲,只要某个项目想用MaxCompute,阿里巴巴MaxCompute团队就能为他们提供合手的工具,再依靠MaxCompute团队提供的强大计算力使用合理的方式将数据价值体现出来

BigQuery User Defined Aggregation Function?

我的未来我决定 提交于 2019-11-30 04:23:35
问题 I know I can define a User Defined Function in order to perform some custom calculation. I also know I can use the 'out-of-the-box' aggregation functions to reduce a collection of values to a single value when using a GROUP BY clause. Is it possible to define a custom user-defined, Aggregation Function to use with a GROUP BY clause? 回答1: Turns out that this IS possible (as long as the groups we seek to aggregate are of a reasonable size in memory) with a little bit of 'glue' - namely the

Structured Streaming 简单数据处理——读取CSV并提取列关键词

╄→гoц情女王★ 提交于 2019-11-29 09:39:52
前言 近日想学学Spark 比较新的Structured Streaming ,百度一轮下来,全都是千篇一律的wordcount ,很是无语。只好自己摸索,除了Dataframe的Select和Filter 操作还能做些什么处理。因为用的Python,用过Pandas,摸索中,想转Pandas去处理,结果readStream并不支持直接toPandas()这个方法。最后翻来官方API,发现了还有Dataframe还有一个强大的操作,并且能够在readStream中使用,那就是——UDF。 环境准备 Hadoop 2.8.5 Spark 2.4.3 Python 3.7.3 jieba (jieba分词工具,提供了TF-IDF关键词提取方法,pip install jieba) 程序下面的代码都是在交互式环境下执行,即pyspark下。 数据准备 id title_zh content_zh publish_date 假设CSV数据如上表格所示,分别表示文章id,标题,内容,发布时间。 有如下需求:提取标题的关键词,并将关键词添加到新列。(本来还有提取文章关键词,原理其实一样,就不多写了) 读取数据 读取csv文件有两步:定义schema,按照schema读取文件。 定义schema: 本例中,id为Integer类型,publish_date为TimestampType类型

Is there a way to measure string similarity in Google BigQuery

不羁的心 提交于 2019-11-29 04:47:29
I'm wondering if anyone knows of a way to measure string similarity in BigQuery. Seems like would be a neat function to have. My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article. I can find examples using javascript so maybe a UDF is the way to go but i've not used UDF's at all (or javascript for that matter :) ) Just wondering if there may be a way using existing regex functions or if anyone might be able to get me started with porting the javascript example into a UDF. Any help much appreciated, thanks EDIT: Adding some example

如何简化 SQL 语句之 UDF 实践

喜欢而已 提交于 2019-11-29 03:14:52
UDF(User Defined Function 用户自定义函数)是 SQL 环境中很关键的特性。通过写 UDF,开发者可以方便地插入常用的处理代码并在查询中使用。Apache Kylin 支持持久化的 UDF。来自华安保险的赵兴成特别带来了 Kylin 中 UDF 的分享,快跟着兴成一探究竟吧~ 背景 Apache Kylin 作为 OLAP 神器,在海量数据的多维分析方面优势明显,给我们的工作提供了很大帮助,也一直是华安保险 OLAP 系统的后台的主要支撑。在我们的系统中,把指标分成基础指标和衍生指标两大类,基础指标即不可再分解,由业务直接生成的指标,衍生指标则是由基础指标通过混合计算得出。 一直以来,为了节省空间,增加构建速度,Kylin 的 Cube 中我们只保留了基础指标,而衍生指标基本是通过应用层的 SQL 来计算解决,这样带来两个问题: SQL 复杂,可读性差,比如需要加入 case when 和类似 (commamt+devamt+servamt)/grossprm as ** 的语法; 易出错,如果有开发人员对指标定义不熟悉或者理解有偏差,极易造成计算结果错误。 如何解决? 用过 Hive 的人都清楚,Hive 中有个 UDF(User Defined Function)功能,非常好用。那么 Kylin 中是否有这样的功能呢?有的,只有三个:三个类都位于 org

MYSQL中binlog_format模式与配置详解

烂漫一生 提交于 2019-11-28 21:08:05
一、binlog复制方式 mysql复制主要有三种方式:基于SQL语句的复制(statement-based replication, SBR),基于行的复制(row-based replication, RBR),混合模式复制(mixed-based replication, MBR)。对应的,binlog的格式也有三种:STATEMENT,ROW,MIXED。 ① STATEMENT模式(SBR) 每一条会修改数据的sql语句会记录到binlog中,slave在复制的时候sql进程会解析成master端执行过的相同的sql在slave库上再次执行。 优点:statement level下的优点首先就是解决了row level下的缺点,不需要每一条sql语句和记录每一行的变化,较少binlog日志量,节约IO,提高性能。因为它只需要记录在master上所执行的语句的细节,以及执行语句时候的上下文信息。 缺点:由于它是记录执行语句,所以,为了让这些语句在slave端也能正确执行,那么它还必须记录每条语句在执行的时候的一些相关信息,也就是上下文信息,来保证所有语句在slave端能够得到和在master端相同的执行结果。由于mysql更新较快,使mysql的赋值遇到了不小的挑战,自然赋值的时候就会涉及到越复杂的内容,bug也就容易出现。在statement level下

Spark UDF for StructType / Row

杀马特。学长 韩版系。学妹 提交于 2019-11-28 09:06:23
I have a "StructType" column in spark Dataframe that has an array and a string as sub-fields. I'd like to modify the array and return the new column of the same type. Can I process it with UDF? Or what are the alternatives? import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val sub_schema = StructType(StructField("col1",ArrayType(IntegerType,false),true) :: StructField("col2",StringType,true)::Nil) val schema = StructType(StructField("subtable", sub_schema,true) :: Nil) val data = Seq(Row(Row(Array(1,2),"eb")), Row(Row(Array(3,2,1), "dsf")) ) val rd = sc.parallelize(data) val

UDF returns the same value everywhere

杀马特。学长 韩版系。学妹 提交于 2019-11-28 01:34:41
I am trying to code in moving average in vba but the following returns the same value everywhere. Function trial1(a As Integer) As Variant Application.Volatile Dim rng As Range Set rng = Range(Cells(ActiveCell.Row, 2), Cells(ActiveCell.Row - a + 1, 2)) trial1 = (Application.Sum(rng)) * (1 / a) End Function The ActiveCell property does not belong in a UDF because it changes . Sometimes, it is not even on the same worksheet. If you need to refer to the cell in which the custom UDF function resides on the worksheet, use the Application.Caller method. The Range.Parent property can be used to

Spark UDF for StructType / Row

有些话、适合烂在心里 提交于 2019-11-27 02:09:38
问题 I have a "StructType" column in spark Dataframe that has an array and a string as sub-fields. I'd like to modify the array and return the new column of the same type. Can I process it with UDF? Or what are the alternatives? import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val sub_schema = StructType(StructField("col1",ArrayType(IntegerType,false),true) :: StructField("col2",StringType,true)::Nil) val schema = StructType(StructField("subtable", sub_schema,true) :: Nil) val

Is there a way to measure string similarity in Google BigQuery

倾然丶 夕夏残阳落幕 提交于 2019-11-26 23:18:47
问题 I'm wondering if anyone knows of a way to measure string similarity in BigQuery. Seems like would be a neat function to have. My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article. I can find examples using javascript so maybe a UDF is the way to go but i've not used UDF's at all (or javascript for that matter :) ) Just wondering if there may be a way using existing regex functions or if anyone might be able to get me started with