udf

Define spark udf by reflection on a String

…衆ロ難τιáo~ 提交于 2019-12-06 05:48:35
问题 I am trying to define a udf in spark(2.0) from a string containing scala function definition.Here is the snippet: val universe: scala.reflect.runtime.universe.type = scala.reflect.runtime.universe import universe._ import scala.reflect.runtime.currentMirror import scala.tools.reflect.ToolBox val toolbox = currentMirror.mkToolBox() val f = udf(toolbox.eval(toolbox.parse("(s:String) => 5")).asInstanceOf[String => Int]) sc.parallelize(Seq("1","5")).toDF.select(f(col("value"))).show This gives me

Trying to turn a blob into multiple columns in Spark

风流意气都作罢 提交于 2019-12-06 05:16:26
I have a serialized blob and a function that converts it into a java Map. I have registered the function as a UDF and tried to use it in Spark SQL as follows: sqlCtx.udf.register("blobToMap", Utils.blobToMap) val df = sqlCtx.sql(""" SELECT mp['c1'] as c1, mp['c2'] as c2 FROM (SELECT *, blobToMap(payload) AS mp FROM t1) a """) I do succeed in doing it, but for some reason the very heavy blobToMap function runs twice for every row, and in reality I extract 20 fields and it runs 20 times for every row. I saw the suggestions in Derive multiple columns from a single column in a Spark DataFrame but

Making API call as part of UDF in BigQuery - possible?

南楼画角 提交于 2019-12-06 02:50:53
问题 I'm wondering if it would be possible to make a api call to the google maps geocoding api within a UDF in BigQuery? I have Google analytics geo fields such as { "geoNetwork_continent": "Europe", "geoNetwork_subContinent": "Eastern Europe", "geoNetwork_country": "Russia", "geoNetwork_region": "Novosibirsk Oblast", "geoNetwork_metro": "(not set)" }, And would like to make calls to: https://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key=XXXX

Hive UDF for selecting all except some columns

别等时光非礼了梦想. 提交于 2019-12-05 18:15:35
问题 The common query building pattern in HiveQL (and SQL in general) is to either select all columns ( SELECT * ) or an explicitly-specified set of columns ( SELECT A, B, C ). SQL has no built-in mechanism for selecting all but a specified set of columns. There are various mechanisms for excluding some columns as outlined in this SO question but none apply naturally to HiveQL. (For example, the idea to create a temporary table with SELECT * then ALTER TABLE DROP some of its columns would wreak

MaxCompute新功能发布

血红的双手。 提交于 2019-12-05 06:46:25
2018年Q3 MaxCompute重磅发布了一系列新功能。 本文对主要新功能和增强功能进行了概述。 实时交互式查询:Lightning on MaxCompute 生态兼容:Spark on MaxCompute New SQL 新特性发布 Python UDF全面开放 OSS外表功能正式商业化 Hash Clustering 存储技术升级:zstd压缩算法 作者: 云花 原文链接 本文为云栖社区原创内容,未经允许不得转载。 来源: oschina 链接: https://my.oschina.net/u/3552485/blog/2988601

Spark SQL: How to call UDF from DataFrame operation using JAVA

。_饼干妹妹 提交于 2019-12-04 18:34:32
I would like to know how to call UDF function from function of domain-specific language(DSL) in Spark SQL using JAVA. I have UDF function (just for example): UDF2 equals = new UDF2<String, String, Boolean>() { @Override public Boolean call(String first, String second) throws Exception { return first.equals(second); } }; I've registered it to sqlContext sqlContext.udf().register("equals", equals, DataTypes.BooleanType); When I run following query, my UDF is called and I get a result. sqlContext.sql("SELECT p0.value FROM values p0 WHERE equals(p0.value, 'someString')"); I would transfrom this

Define spark udf by reflection on a String

末鹿安然 提交于 2019-12-04 11:12:09
I am trying to define a udf in spark(2.0) from a string containing scala function definition.Here is the snippet: val universe: scala.reflect.runtime.universe.type = scala.reflect.runtime.universe import universe._ import scala.reflect.runtime.currentMirror import scala.tools.reflect.ToolBox val toolbox = currentMirror.mkToolBox() val f = udf(toolbox.eval(toolbox.parse("(s:String) => 5")).asInstanceOf[String => Int]) sc.parallelize(Seq("1","5")).toDF.select(f(col("value"))).show This gives me an error : Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection

MaxCompute问答整理之10月

假如想象 提交于 2019-12-04 08:35:52
本文是基于本人对MaxCompute产品的学习进度,再结合开发者社区里面的一些问题,进而整理成文。希望对大家有所帮助。 问题一、DataStudio中是否可以通过shell节点调取MaxCompute sql语句? 不可以的,Shell节点支持标准Shell语法,不支持交互性语法。如果任务较多,可以使用ODPS SQL节点来完成任务的执行。关于DataStudio的其他介绍请参考官方文档: https://help.aliyun.com/document_detail/74423.html 问题二、MaxCompute支持修改表字段的数据类型吗? 不支持,只能添加字段列,生产表不允许删除字段、修改字段及分区字段,如果必须修改,请删除之后重新建表,可以将表建立成外部表,在表删除重建以后,能将数据重新加载回来。 数据类型请参考官方文档: https://help.aliyun.com/document_detail/27821.html 问题三、MaxCompute除了UDF函数的方式外,有没有别的办法将两个没有任何关联关系的表合并成一张表呢? 可以纵向合并使用union all,横向合并的话可以借助row number,两张表都新加一个新的ID列,进行ID关联,然后取两张表的字段。 问题四、现有账号的AK禁用,创建一个新的AK,会对之前AK创建的周期性任务有影响吗? 有的

Making API call as part of UDF in BigQuery - possible?

ぃ、小莉子 提交于 2019-12-04 07:49:36
I'm wondering if it would be possible to make a api call to the google maps geocoding api within a UDF in BigQuery? I have Google analytics geo fields such as { "geoNetwork_continent": "Europe", "geoNetwork_subContinent": "Eastern Europe", "geoNetwork_country": "Russia", "geoNetwork_region": "Novosibirsk Oblast", "geoNetwork_metro": "(not set)" }, And would like to make calls to: https://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key=XXXX Just wondering if i'd be able to use javascript within the UDF to make an api call for each row in

Hive详解

淺唱寂寞╮ 提交于 2019-12-04 07:04:51
1. Hive基本概念 1.1 Hive简介 1.1.1 什么是Hive Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供类SQL查询功能。 1.1.2 为什么使用Hive 直接使用hadoop所面临的问题 人员学习成本太高 项目周期要求太短 MapReduce实现复杂查询逻辑开发难度太大 为什么要使用Hive 操作接口采用类SQL语法,提供快速开发的能力。 避免了去写MapReduce,减少开发人员的学习成本。 扩展功能很方便。 1.1.3 Hive的特点 可扩展 Hive可以自由的扩展集群的规模,一般情况下不需要重启服务。 延展性 Hive支持用户自定义函数,用户可以根据自己的需求来实现自己的函数。 容错 良好的容错性,节点出现问题SQL仍可完成执行。 1.2 Hive架构 1.2.1 架构图 Jobtracker是hadoop1.x中的组件,它的功能相当于: Resourcemanager+AppMaster TaskTracker 相当于: Nodemanager + yarnchild 1.2.2 基本组成 用户接口:包括 CLI、JDBC/ODBC、WebGUI。 元数据存储:通常是存储在关系数据库如 mysql , derby中。 解释器、编译器、优化器、执行器。 用户接口主要由三个:CLI、JDBC