pyspark

How to create a udf in PySpark which returns an array of strings?

不打扰是莪最后的温柔 提交于 2020-07-17 07:24:18
问题 I have a udf which returns a list of strings. this should not be too hard. I pass in the datatype when executing the udf since it returns an array of strings: ArrayType(StringType) . Now, somehow this is not working: the dataframe i'm operating on is df_subsets_concat and looks like this: df_subsets_concat.show(3,False) +----------------------+ |col1 | +----------------------+ |oculunt | |predistposed | |incredulous | +----------------------+ only showing top 3 rows and the code is from

How to create a udf in PySpark which returns an array of strings?

泪湿孤枕 提交于 2020-07-17 07:24:11
问题 I have a udf which returns a list of strings. this should not be too hard. I pass in the datatype when executing the udf since it returns an array of strings: ArrayType(StringType) . Now, somehow this is not working: the dataframe i'm operating on is df_subsets_concat and looks like this: df_subsets_concat.show(3,False) +----------------------+ |col1 | +----------------------+ |oculunt | |predistposed | |incredulous | +----------------------+ only showing top 3 rows and the code is from

Spark: Prevent shuffle/exchange when joining two identically partitioned dataframes

陌路散爱 提交于 2020-07-17 05:50:10
问题 I have two dataframes df1 and df2 and I want to join these tables many times on a high cardinality field called visitor_id . I would like to perform only one initial shuffle and have all the joins take place without shuffling/exchanging data between spark executors. To do so, I have created another column called visitor_partition that consistently assigns each visitor_id a random value between [0, 1000) . I have used a custom partitioner to ensure that the df1 and df2 are exactly partitioned

Using Scala classes as UDF with pyspark

試著忘記壹切 提交于 2020-07-16 00:45:13
问题 I'm trying to offload some computations from Python to Scala when using Apache Spark. I would like to use the class interface from Java to be able to use a persistent variable, like so (this is a nonsensical MWE based on my more complex use case): package mwe import org.apache.spark.sql.api.java.UDF1 class SomeFun extends UDF1[Int, Int] { private var prop: Int = 0 override def call(input: Int): Int = { if (prop == 0) { prop = input } prop + input } } Now I'm attempting to use this class from

split content of column into lines in pyspark

有些话、适合烂在心里 提交于 2020-07-15 09:47:48
问题 I have a dataframe df: +------+----------+--------------------+ |SiteID| LastRecID| Col_to_split| +------+----------+--------------------+ | 2|1056962584|[214, 207, 206, 205]| | 2|1056967423| [213, 208]| | 2|1056870114| [213, 202, 199]| | 2|1056876861|[203, 213, 212, 1...| I want to split the column into lines like this: +----------+-------------+-------------+ | RecID| index| Value| +----------+-------------+-------------+ |1056962584| 0| 214| |1056962584| 1| 207| |1056962584| 2| 206|

pyspark dataframe withColumn command not working

杀马特。学长 韩版系。学妹 提交于 2020-07-15 09:22:32
问题 I have a input dataframe: df_input ( updated df_input ) |comment|inp_col|inp_val| |11 |a |a1 | |12 |a |a2 | |15 |b |b3 | |16 |b |b4 | |17 |c |&b | |17 |c |c5 | |17 |d |&c | |17 |d |d6 | |17 |e |&d | |17 |e |e7 | I want to replace the variable in inp_val column to its value. I have tried with the below code to create a new column. Taken the list of values which starts with '&' df_new = df_inp.select(inp_val).where(df.inp_val.substr(0, 1) == '&') Now I'm iterating over the list to replace the '

pyspark dataframe withColumn command not working

倖福魔咒の 提交于 2020-07-15 09:22:05
问题 I have a input dataframe: df_input ( updated df_input ) |comment|inp_col|inp_val| |11 |a |a1 | |12 |a |a2 | |15 |b |b3 | |16 |b |b4 | |17 |c |&b | |17 |c |c5 | |17 |d |&c | |17 |d |d6 | |17 |e |&d | |17 |e |e7 | I want to replace the variable in inp_val column to its value. I have tried with the below code to create a new column. Taken the list of values which starts with '&' df_new = df_inp.select(inp_val).where(df.inp_val.substr(0, 1) == '&') Now I'm iterating over the list to replace the '

pyspark dataframe withColumn command not working

孤人 提交于 2020-07-15 09:22:05
问题 I have a input dataframe: df_input ( updated df_input ) |comment|inp_col|inp_val| |11 |a |a1 | |12 |a |a2 | |15 |b |b3 | |16 |b |b4 | |17 |c |&b | |17 |c |c5 | |17 |d |&c | |17 |d |d6 | |17 |e |&d | |17 |e |e7 | I want to replace the variable in inp_val column to its value. I have tried with the below code to create a new column. Taken the list of values which starts with '&' df_new = df_inp.select(inp_val).where(df.inp_val.substr(0, 1) == '&') Now I'm iterating over the list to replace the '

Explode list of dictionaries into additional columns in Spark

此生再无相见时 提交于 2020-07-10 10:30:35
问题 I currently have a UDF that takes a column of xml strings and parses it into lists of dictionaries. I then want to explode that list of dictionaries column out into additional columns based on the key-value pairs. Input looks like this: id type length parsed 0 1 A 144 [{'key1':'value1'},{'key1':'value2', 'key2':'value3'},...] 1 1 B 20 [{'key1':'value4'},{'key2':'value5'},...] 2 4 A 54 [{'key3':'value6'},...] And I want the output to look like this: id type length key1 key2 key3 0 1 A 144

Explode list of dictionaries into additional columns in Spark

℡╲_俬逩灬. 提交于 2020-07-10 10:30:07
问题 I currently have a UDF that takes a column of xml strings and parses it into lists of dictionaries. I then want to explode that list of dictionaries column out into additional columns based on the key-value pairs. Input looks like this: id type length parsed 0 1 A 144 [{'key1':'value1'},{'key1':'value2', 'key2':'value3'},...] 1 1 B 20 [{'key1':'value4'},{'key2':'value5'},...] 2 4 A 54 [{'key3':'value6'},...] And I want the output to look like this: id type length key1 key2 key3 0 1 A 144