pyspark | 易学教程

How to create a udf in PySpark which returns an array of strings?

阅读更多关于 How to create a udf in PySpark which returns an array of strings?

问题 I have a udf which returns a list of strings. this should not be too hard. I pass in the datatype when executing the udf since it returns an array of strings: ArrayType(StringType) . Now, somehow this is not working: the dataframe i'm operating on is df_subsets_concat and looks like this: df_subsets_concat.show(3,False) +----------------------+ |col1 | +----------------------+ |oculunt | |predistposed | |incredulous | +----------------------+ only showing top 3 rows and the code is from

How to create a udf in PySpark which returns an array of strings?

阅读更多关于 How to create a udf in PySpark which returns an array of strings?

Spark: Prevent shuffle/exchange when joining two identically partitioned dataframes

阅读更多关于 Spark: Prevent shuffle/exchange when joining two identically partitioned dataframes

问题 I have two dataframes df1 and df2 and I want to join these tables many times on a high cardinality field called visitor_id . I would like to perform only one initial shuffle and have all the joins take place without shuffling/exchanging data between spark executors. To do so, I have created another column called visitor_partition that consistently assigns each visitor_id a random value between [0, 1000) . I have used a custom partitioner to ensure that the df1 and df2 are exactly partitioned

Using Scala classes as UDF with pyspark

阅读更多关于 Using Scala classes as UDF with pyspark

问题 I'm trying to offload some computations from Python to Scala when using Apache Spark. I would like to use the class interface from Java to be able to use a persistent variable, like so (this is a nonsensical MWE based on my more complex use case): package mwe import org.apache.spark.sql.api.java.UDF1 class SomeFun extends UDF1[Int, Int] { private var prop: Int = 0 override def call(input: Int): Int = { if (prop == 0) { prop = input } prop + input } } Now I'm attempting to use this class from

split content of column into lines in pyspark

阅读更多关于 split content of column into lines in pyspark

问题 I have a dataframe df: +------+----------+--------------------+ |SiteID| LastRecID| Col_to_split| +------+----------+--------------------+ | 2|1056962584|[214, 207, 206, 205]| | 2|1056967423| [213, 208]| | 2|1056870114| [213, 202, 199]| | 2|1056876861|[203, 213, 212, 1...| I want to split the column into lines like this: +----------+-------------+-------------+ | RecID| index| Value| +----------+-------------+-------------+ |1056962584| 0| 214| |1056962584| 1| 207| |1056962584| 2| 206|

pyspark dataframe withColumn command not working

阅读更多关于 pyspark dataframe withColumn command not working

问题 I have a input dataframe: df_input ( updated df_input ) |comment|inp_col|inp_val| |11 |a |a1 | |12 |a |a2 | |15 |b |b3 | |16 |b |b4 | |17 |c |&b | |17 |c |c5 | |17 |d |&c | |17 |d |d6 | |17 |e |&d | |17 |e |e7 | I want to replace the variable in inp_val column to its value. I have tried with the below code to create a new column. Taken the list of values which starts with '&' df_new = df_inp.select(inp_val).where(df.inp_val.substr(0, 1) == '&') Now I'm iterating over the list to replace the '

pyspark dataframe withColumn command not working

阅读更多关于 pyspark dataframe withColumn command not working

pyspark dataframe withColumn command not working

阅读更多关于 pyspark dataframe withColumn command not working

Explode list of dictionaries into additional columns in Spark

阅读更多关于 Explode list of dictionaries into additional columns in Spark

问题 I currently have a UDF that takes a column of xml strings and parses it into lists of dictionaries. I then want to explode that list of dictionaries column out into additional columns based on the key-value pairs. Input looks like this: id type length parsed 0 1 A 144 [{'key1':'value1'},{'key1':'value2', 'key2':'value3'},...] 1 1 B 20 [{'key1':'value4'},{'key2':'value5'},...] 2 4 A 54 [{'key3':'value6'},...] And I want the output to look like this: id type length key1 key2 key3 0 1 A 144

Explode list of dictionaries into additional columns in Spark

阅读更多关于 Explode list of dictionaries into additional columns in Spark