user-defined-functions

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

◇◆丶佛笑我妖孽 提交于 2021-02-04 18:58:09
问题 I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer I'm fairly sure this is because one of the groups the pandas UDF receives is huge, and if I reduce the dataset and removes enough rows I can run my UDF with no problems. However, I want to run with my original dataset and even if I run this spark job on a machine with

How to use Pandas UDF Functionality in pyspark

别等时光非礼了梦想. 提交于 2021-01-29 10:52:12
问题 I have a spark frame with two columns which looks like: +-------------------------------------------------------------+------------------------------------+ |docId |id | +-------------------------------------------------------------+------------------------------------+ |DYSDG6-RTB-91d663dd-949e-45da-94dd-e604b6050cb5-1537142434000|91d663dd-949e-45da-94dd-e604b6050cb5| |VAVLS7-RTB-8e2c1917-0d6b-419b-a59e-cd4acc255bb7-1537142445000|8e2c1917-0d6b-419b-a59e-cd4acc255bb7| |VAVLS7-RTB-c818dcde

Variant Array Custom Function Google Sheets? VBA For Example

99封情书 提交于 2021-01-29 10:30:26
问题 The following function below will add "1" to every column across the excel sheet. If I put =vbe(12) in A1, it will put "1" in columns "A1:L1". How can I translate this VBA to JavaScript for Google Sheets? Function vbe(Num As Long) As Variant Dim ary As Variant Dim i As Long ReDim ary(Num - 1) For i = 0 To Num - 1 ary(i) = 1 Next i vbe = ary End Function 回答1: You can write a custom formula that creates an array of "1"s with the length as a specified parameter, e.g. function myFunction

Scala — Conditional replace column value of a data frame

蹲街弑〆低调 提交于 2021-01-29 08:43:08
问题 DataFrame 1 is what I have now, and I want to write a Scala function to make DataFrame 1 look like DataFrame 2. Transfer is the big category; e-transfer and IMT are subcategories. The Logic is that for a same ID (31898), if both Transfer and e-Transfer tagged to it, it should only be e-Transfer; if Transfer and IMT and e-Transfer all tagged to a same ID (32614), it should be e-Transfer + IMT; If only Transfer tagged to one ID (33987), it should be Other; if only e-Transfer or IMT tagged to a

Adding a weight constraint to Max Sharpe Ratio function in python

*爱你&永不变心* 提交于 2021-01-29 07:33:10
问题 I have the following formula to calculate Max Sharpe Ratio for a given set of returns: def msr(riskfree_rate, er, cov): """ Returns the weights of the portfolio that gives you the maximum sharpe ratio given the riskfree rate and expected returns and a covariance matrix """ n = er.shape[0] init_guess = np.repeat(1/n, n) bounds = ((0.0, 1.0),) * n # an N-tuple of 2-tuples! # construct the constraints weights_sum_to_1 = {'type': 'eq', 'fun': lambda weights: np.sum(weights) - 1 } def neg_sharpe

Timezone conversion with pyspark from timestamp and country

爱⌒轻易说出口 提交于 2021-01-28 18:44:31
问题 I'm trying to convert UTC date to date with local timezone (using the country) with PySpark. I have the country as string and the date as timestamp So the input is : date = Timestamp('2016-11-18 01:45:55') # type is pandas._libs.tslibs.timestamps.Timestamp country = "FR" # Type is string import pytz import pandas as pd def convert_date_spark(date, country): timezone = pytz.country_timezones(country)[0] local_time = date.replace(tzinfo = pytz.utc).astimezone(timezone) date, time = local_time

Use external library in pandas_udf in pyspark

依然范特西╮ 提交于 2021-01-28 18:39:09
问题 It's possible to use a external library like textdistance inside pandas_udf? I have tried and I get this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I have tried with Spark version 2.3.1. 回答1: You can package the textdistance together with your own code (use setup.py and bdist_egg to build an egg file), and specify the final package with option --py-files while you run spark. btw, the error message doesn't seem to relate

Use external library in pandas_udf in pyspark

那年仲夏 提交于 2021-01-28 18:31:40
问题 It's possible to use a external library like textdistance inside pandas_udf? I have tried and I get this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I have tried with Spark version 2.3.1. 回答1: You can package the textdistance together with your own code (use setup.py and bdist_egg to build an egg file), and specify the final package with option --py-files while you run spark. btw, the error message doesn't seem to relate

Checking whether a column has proper decimal number

♀尐吖头ヾ 提交于 2021-01-28 08:55:14
问题 I have a dataframe ( input_dataframe ), which looks like as below: id test_column 1 0.25 2 1.1 3 12 4 test 5 1.3334 6 .11 I want to add a column result , which put values 1 if test_column has a decimal value and 0 if test_column has any other value. data type of test_column is string. Below is the expected output: id test_column result 1 0.25 1 2 1.1 1 3 12 0 4 test 0 5 1.3334 1 6 .11 1 Can we achieve it using pySpark code? 回答1: You can parse decimal token with decimal.Decimal() Here we are

Implicit schema for pandas_udf in PySpark?

南楼画角 提交于 2021-01-27 18:01:32
问题 This answer nicely explains how to use pyspark's groupby and pandas_udf to do custom aggregations. However, I cannot possibly declare my schema manually as shown in this part of the example from pyspark.sql.types import * schema = StructType([ StructField("key", StringType()), StructField("avg_min", DoubleType()) ]) since I will be returning 100+ columns with names that are automatically generated. Is there any way to tell PySpark to just implicetely use the Schema returned by my function and