user-defined-functions

How to use countDistinct in Scala with Spark?

冷暖自知 提交于 2019-12-22 10:35:15
问题 I've tried to use countDistinct function which should be available in Spark 1.5 according to DataBrick's blog. However, I got the following exception: Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function countDistinct; I've found that on Spark developers' mail list they suggest using count and distinct functions to get the same result which should be produced by countDistinct : count(distinct <columnName>) // Instead countDistinct(<columnName>) Because I build

how to add a permanent function in hive?

…衆ロ難τιáo~ 提交于 2019-12-22 10:15:36
问题 Here is the problem: If I declare a temporary function in hive like this: add jar /home/taobao/oplog/hivescript/my_udf.jar; create temporary function getContentValue as 'com.my.udf.GetContentValue'; It'll works fine with function getContentValue in this hive session . But what I want is not having to add jar as well as create temporary function every time I start a hive session. That is to say, make the function permanent . Is there any solutions to this problem? 回答1: As of 0.13.0 (HIVE-6047)

Finding all cells that have been filled with any color and highlighting corresponding column headers in excel vba

笑着哭i 提交于 2019-12-22 10:02:17
问题 My problem: I've made a large (2,000 line) macro that runs on our company's template and fixes some common issues and highlights other issues we have prior to importing. The template file always has 150 columns and is in most instances 15,000+ rows (sometimes even over 30,000). The macro works well, highlighting all the cells that contain errors according to our data rules, but with a file with so many columns and rows I thought it'd be convenient to add a snippet to my macro that would have

UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings

走远了吗. 提交于 2019-12-22 09:55:06
问题 I am running spark 2.4.2 locally through pyspark for an ML project in NLP. Part of the pre-processing steps in the Pipeline involve the use of pandas_udf functions optimized through pyarrow . Each time I operate with the pre-processed spark dataframe the following warning appears: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " I tried updating pyarrow but didn't manage to avoid the warning. My

Can I make SQL Server FORMAT deterministic?

六眼飞鱼酱① 提交于 2019-12-22 09:47:51
问题 I want to make a UDF which returns an integer form of YYYYMM so that I can easily partition some things on month. I am trying to assign this function to the value of a PERSISTED computed column. I currently have the following, which works fine: CREATE FUNCTION dbo.GetYearMonth(@pDate DATETIME2) RETURNS INT WITH SCHEMABINDING AS BEGIN DECLARE @fYear VARCHAR(4) = RIGHT('0000' + CAST(YEAR(@pDate) AS VARCHAR),4) DECLARE @fMonth VARCHAR(2) = RIGHT('00' + CAST(MONTH(@pDate) AS VARCHAR),2) RETURN

postgres functions: when does IMMUTABLE hurt performance?

穿精又带淫゛_ 提交于 2019-12-22 05:09:13
问题 The Postgres docs say For best optimization results, you should label your functions with the strictest volatility category that is valid for them. However, I seem to have an example where this is not the case, and I'd like to understand what's going on. (Background: I'm running postgres 9.2) I often need to convert times expressed as integer numbers of seconds to dates. I've written a function to do this: CREATE OR REPLACE FUNCTION to_datestamp(time_int double precision) RETURNS date AS $$

Pass table as parameter to SQLCLR TV-UDF

跟風遠走 提交于 2019-12-22 03:55:50
问题 We have a third-party DLL that can operate on a DataTable of source information and generate some useful values, and we're trying to hook it up through SQLCLR to be callable as a table-valued UDF in SQL Server 2008. Taking the concept here one step further, I would like to program a CLR Table-Valued Function that operates on a table of source data from the DB. I'm pretty sure I understand what needs to happen on the T-SQL side of things; but, what should the method signature look like in the

SQLServer cannot find my user defined function function in stored procedure

为君一笑 提交于 2019-12-22 01:35:26
问题 I must have some permissions wrong, but I can't figure out how. The following code is simplified but I can't even get this to work CREATE FUNCTION ufTest ( @myParm int ) RETURNS int AS BEGIN DECLARE @Result int SELECT @Result = @myParm + 1 RETURN @Result END GO Then I just want to be able to call the function from a stored procedure: CREATE PROCEDURE dbo.[uspGetGroupProfileService] @id int AS BEGIN SET NOCOUNT ON; DECLARE @otherId int; SET @otherId = dbo.ufTest(@id); END SQLServer keeps

Unable to Pass Arguments in User Define Function to ggplot

断了今生、忘了曾经 提交于 2019-12-21 21:38:03
问题 I want to create a user define function that wraps around some popular ggplot code I found. I am getting the following error: "Error in `[.data.frame`(DS, , xvar) : object 'xcol' not found" The following is a small pseudo dataset to illustrate the issue. n=25 dataTest <- data.frame(xcol=sample(1:3, n, replace=TRUE), ycol = rnorm(n, 5, 2), Cat=letters[1:5]) The user defined code is as here: TRIPLOT <- function (DS,xvar,yvar,zvar) { #localenv<-environment() gg <- data.frame(x=DS[,xvar],y=DS[

How to reference a dataframe when in an UDF on another dataframe?

会有一股神秘感。 提交于 2019-12-21 18:10:12
问题 How do you reference a pyspark dataframe when in the execution of an UDF on another dataframe? Here's a dummy example. I am creating two dataframes scores and lastnames , and within each lies a column that is the same across the two dataframes. In the UDF applied on scores , I want to filter on lastnames and return a string found in lastname . from pyspark import SparkContext from pyspark import SparkConf from pyspark.sql import SQLContext from pyspark.sql.types import * sc = SparkContext(