user-defined-functions | 易学教程

Spark column string replace when present in other column (row)

阅读更多关于 Spark column string replace when present in other column (row)

问题 I would like to remove strings from col1 that are present in col2 : val df = spark.createDataFrame(Seq( ("Hi I heard about Spark", "Spark"), ("I wish Java could use case classes", "Java"), ("Logistic regression models are neat", "models") )).toDF("sentence", "label") using regexp_replace or translate ref: spark functions api val res = df.withColumn("sentence_without_label", regexp_replace (col("sentence") , "(?????)", "" )) so that res looks as below: 回答1: You could simply use regexp_replace

Spark column string replace when present in other column (row)

阅读更多关于 Spark column string replace when present in other column (row)

using a sheet in an excel user defined function

阅读更多关于 using a sheet in an excel user defined function

问题 The VBA I'm trying to write is fairly simple but Ive never written VBA and coming from the visual studio & C# world, this truly is hell!! So I'll really be grateful for any help/pointers/hints here So I have two important sheets. The Range sheet has 2 values per date. It needs a result, The Calc sheet takes two values and gets me a result. I want to Put the Current, and OneYear values for each date into the Calc sheet and get the result into the result column. So I tried defining a UDF but I

Apache Spark — Assign the result of UDF to multiple dataframe columns

阅读更多关于 Apache Spark — Assign the result of UDF to multiple dataframe columns

问题 I'm using pyspark, loading a large csv file into a dataframe with spark-csv, and as a pre-processing step I need to apply a variety of operations to the data available in one of the columns (that contains a json string). That will return X values, each of which needs to be stored in their own separate column. That functionality will be implemented in a UDF. However, I am not sure how to return a list of values from that UDF and feed these into individual columns. Below is a simple example: (.

How to find mean of grouped Vector columns in Spark SQL?

阅读更多关于 How to find mean of grouped Vector columns in Spark SQL?

问题 I have created a RelationalGroupedDataset by calling instances.groupBy(instances.col("property_name")) : val x = instances.groupBy(instances.col("property_name")) How do I compose a user-defined aggregate function to perform Statistics.colStats().mean on each group? Thanks! 回答1: Spark >= 2.4 You can use Summarizer : import org.apache.spark.ml.stat.Summarizer val dfNew = df.as[(Int, org.apache.spark.mllib.linalg.Vector)] .map { case (group, v) => (group, v.asML) } .toDF("group", "features")

SQL Server 2008 - How do i return a User-Defined Table Type from a Table-Valued Function?

阅读更多关于 SQL Server 2008 - How do i return a User-Defined Table Type from a Table-Valued Function?

问题 Here's my user-defined table type... CREATE TYPE [dbo].[FooType] AS TABLE( [Bar] [INT], ) This is what ive had to do in my table-valued function to return the type: CREATE FUNCTION [dbo].[GetFoos] RETURN @FooTypes TABLE ([Bar] [INT]) INSERT INTO @FooTypes (1) RETURN Basically, im having to re-declare my type definition in the RETURN statement of the function. Isnt there a way i can simply declare the type in the RETURN statement? I would have thought this would work: CREATE FUNCTION [dbo].

Derive multiple columns from a single column in a Spark DataFrame

阅读更多关于 Derive multiple columns from a single column in a Spark DataFrame

问题 I have a DF with a huge parseable metadata as a single string column in a Dataframe, lets call it DFA, with ColmnA. I would like to break this column, ColmnA into multiple columns thru a function, ClassXYZ = Func1(ColmnA). This function returns a class ClassXYZ, with multiple variables, and each of these variables now has to be mapped to new Column, such a ColmnA1, ColmnA2 etc. How would I do such a transformation from 1 Dataframe to another with these additional columns by calling this Func1

Derive multiple columns from a single column in a Spark DataFrame

阅读更多关于 Derive multiple columns from a single column in a Spark DataFrame

Defining a UDF that accepts an Array of objects in a Spark DataFrame?

阅读更多关于 Defining a UDF that accepts an Array of objects in a Spark DataFrame?

问题 When working with Spark's DataFrames, User Defined Functions (UDFs) are required for mapping data in columns. UDFs require that argument types are explicitly specified. In my case, I need to manipulate a column that is made up of arrays of objects, and I do not know what type to use. Here's an example: import sqlContext.implicits._ // Start with some data. Each row (here, there's only one row) // is a topic and a bunch of subjects val data = sqlContext.read.json(sc.parallelize(Seq( """ |{ |

Passing a data frame column and external list to udf under withColumn

阅读更多关于 Passing a data frame column and external list to udf under withColumn

问题 I have a Spark dataframe with following structure. The bodyText_token has the tokens (processed/set of words). And I have a nested list of defined keywords root |-- id: string (nullable = true) |-- body: string (nullable = true) |-- bodyText_token: array (nullable = true) keyword_list=['union','workers','strike','pay','rally','free','immigration',], ['farmer','plants','fruits','workers'],['outside','field','party','clothes','fashions']] I needed to check how many tokens fall under each