udf | 易学教程

Define spark udf by reflection on a String

阅读更多关于 Define spark udf by reflection on a String

问题 I am trying to define a udf in spark(2.0) from a string containing scala function definition.Here is the snippet: val universe: scala.reflect.runtime.universe.type = scala.reflect.runtime.universe import universe._ import scala.reflect.runtime.currentMirror import scala.tools.reflect.ToolBox val toolbox = currentMirror.mkToolBox() val f = udf(toolbox.eval(toolbox.parse("(s:String) => 5")).asInstanceOf[String => Int]) sc.parallelize(Seq("1","5")).toDF.select(f(col("value"))).show This gives me

Trying to turn a blob into multiple columns in Spark

阅读更多关于 Trying to turn a blob into multiple columns in Spark

I have a serialized blob and a function that converts it into a java Map. I have registered the function as a UDF and tried to use it in Spark SQL as follows: sqlCtx.udf.register("blobToMap", Utils.blobToMap) val df = sqlCtx.sql(""" SELECT mp['c1'] as c1, mp['c2'] as c2 FROM (SELECT *, blobToMap(payload) AS mp FROM t1) a """) I do succeed in doing it, but for some reason the very heavy blobToMap function runs twice for every row, and in reality I extract 20 fields and it runs 20 times for every row. I saw the suggestions in Derive multiple columns from a single column in a Spark DataFrame but

Making API call as part of UDF in BigQuery - possible?

阅读更多关于 Making API call as part of UDF in BigQuery - possible?

问题 I'm wondering if it would be possible to make a api call to the google maps geocoding api within a UDF in BigQuery? I have Google analytics geo fields such as { "geoNetwork_continent": "Europe", "geoNetwork_subContinent": "Eastern Europe", "geoNetwork_country": "Russia", "geoNetwork_region": "Novosibirsk Oblast", "geoNetwork_metro": "(not set)" }, And would like to make calls to: https://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key=XXXX

Hive UDF for selecting all except some columns

阅读更多关于 Hive UDF for selecting all except some columns

问题 The common query building pattern in HiveQL (and SQL in general) is to either select all columns ( SELECT * ) or an explicitly-specified set of columns ( SELECT A, B, C ). SQL has no built-in mechanism for selecting all but a specified set of columns. There are various mechanisms for excluding some columns as outlined in this SO question but none apply naturally to HiveQL. (For example, the idea to create a temporary table with SELECT * then ALTER TABLE DROP some of its columns would wreak

MaxCompute新功能发布

阅读更多关于 MaxCompute新功能发布

2018年Q3 MaxCompute重磅发布了一系列新功能。本文对主要新功能和增强功能进行了概述。实时交互式查询：Lightning on MaxCompute 生态兼容：Spark on MaxCompute New SQL 新特性发布 Python UDF全面开放 OSS外表功能正式商业化 Hash Clustering 存储技术升级：zstd压缩算法作者：云花原文链接本文为云栖社区原创内容，未经允许不得转载。来源： oschina 链接： https://my.oschina.net/u/3552485/blog/2988601

Spark SQL: How to call UDF from DataFrame operation using JAVA

阅读更多关于 Spark SQL: How to call UDF from DataFrame operation using JAVA

I would like to know how to call UDF function from function of domain-specific language(DSL) in Spark SQL using JAVA. I have UDF function (just for example): UDF2 equals = new UDF2<String, String, Boolean>() { @Override public Boolean call(String first, String second) throws Exception { return first.equals(second); } }; I've registered it to sqlContext sqlContext.udf().register("equals", equals, DataTypes.BooleanType); When I run following query, my UDF is called and I get a result. sqlContext.sql("SELECT p0.value FROM values p0 WHERE equals(p0.value, 'someString')"); I would transfrom this

Define spark udf by reflection on a String

阅读更多关于 Define spark udf by reflection on a String

I am trying to define a udf in spark(2.0) from a string containing scala function definition.Here is the snippet: val universe: scala.reflect.runtime.universe.type = scala.reflect.runtime.universe import universe._ import scala.reflect.runtime.currentMirror import scala.tools.reflect.ToolBox val toolbox = currentMirror.mkToolBox() val f = udf(toolbox.eval(toolbox.parse("(s:String) => 5")).asInstanceOf[String => Int]) sc.parallelize(Seq("1","5")).toDF.select(f(col("value"))).show This gives me an error : Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection

MaxCompute问答整理之10月

阅读更多关于 MaxCompute问答整理之10月

本文是基于本人对MaxCompute产品的学习进度，再结合开发者社区里面的一些问题，进而整理成文。希望对大家有所帮助。问题一、DataStudio中是否可以通过shell节点调取MaxCompute sql语句？不可以的，Shell节点支持标准Shell语法，不支持交互性语法。如果任务较多，可以使用ODPS SQL节点来完成任务的执行。关于DataStudio的其他介绍请参考官方文档： https://help.aliyun.com/document_detail/74423.html 问题二、MaxCompute支持修改表字段的数据类型吗？不支持，只能添加字段列，生产表不允许删除字段、修改字段及分区字段，如果必须修改，请删除之后重新建表，可以将表建立成外部表，在表删除重建以后，能将数据重新加载回来。数据类型请参考官方文档： https://help.aliyun.com/document_detail/27821.html 问题三、MaxCompute除了UDF函数的方式外，有没有别的办法将两个没有任何关联关系的表合并成一张表呢？可以纵向合并使用union all，横向合并的话可以借助row number，两张表都新加一个新的ID列，进行ID关联，然后取两张表的字段。问题四、现有账号的AK禁用，创建一个新的AK，会对之前AK创建的周期性任务有影响吗？有的

Making API call as part of UDF in BigQuery - possible?

阅读更多关于 Making API call as part of UDF in BigQuery - possible?

I'm wondering if it would be possible to make a api call to the google maps geocoding api within a UDF in BigQuery? I have Google analytics geo fields such as { "geoNetwork_continent": "Europe", "geoNetwork_subContinent": "Eastern Europe", "geoNetwork_country": "Russia", "geoNetwork_region": "Novosibirsk Oblast", "geoNetwork_metro": "(not set)" }, And would like to make calls to: https://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key=XXXX Just wondering if i'd be able to use javascript within the UDF to make an api call for each row in

Hive详解

阅读更多关于 Hive详解

1. Hive基本概念 1.1 Hive简介 1.1.1 什么是Hive Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供类SQL查询功能。 1.1.2 为什么使用Hive 直接使用hadoop所面临的问题人员学习成本太高项目周期要求太短 MapReduce实现复杂查询逻辑开发难度太大为什么要使用Hive 操作接口采用类SQL语法，提供快速开发的能力。避免了去写MapReduce，减少开发人员的学习成本。扩展功能很方便。 1.1.3 Hive的特点可扩展 Hive可以自由的扩展集群的规模，一般情况下不需要重启服务。延展性 Hive支持用户自定义函数，用户可以根据自己的需求来实现自己的函数。容错良好的容错性，节点出现问题SQL仍可完成执行。 1.2 Hive架构 1.2.1 架构图 Jobtracker是hadoop1.x中的组件，它的功能相当于： Resourcemanager+AppMaster TaskTracker 相当于： Nodemanager + yarnchild 1.2.2 基本组成用户接口：包括 CLI、JDBC/ODBC、WebGUI。元数据存储：通常是存储在关系数据库如 mysql , derby中。解释器、编译器、优化器、执行器。用户接口主要由三个：CLI、JDBC