pyspark

Pyspark retain only distinct (drop all duplicates)

穿精又带淫゛_ 提交于 2020-04-18 08:40:11
问题 After joining two dataframes (which have their own ID's) I have some duplicates (repeated ID's from both sources) I want to drop all rows that are duplicates on either ID (so not retain a single occurrence of a duplicate) I can group by the first ID, do a count and filter for count ==1, then repeat that for the second ID, then inner join these outputs back to the original joined dataframe - but this feels a bit long. Is there a simpler method like dropDuplicates() but where none of the

cumulative product in pySpark data frame

萝らか妹 提交于 2020-04-18 07:32:51
问题 I have the following spark DataFrame: +---+---+ | a| b| +---+---+ | 1| 1| | 1| 2| | 1| 3| | 1| 4| +---+---+ I want to make another column named "c" which contains the cumulative product of "b" over "a". The resulting DataFrame should look like: +---+---+---+ | a| b| c| +---+---+---+ | 1| 1| 1| | 1| 2| 2| | 1| 3| 6| | 1| 4| 24| +---+---+---+ How can this be done? 回答1: You have to set an order column. In your case I used column 'b' from pyspark.sql import functions as F, Window, types from

How to fix pyspark NLTK Error with OSError: [WinError 123]?

旧城冷巷雨未停 提交于 2020-04-18 06:12:15
问题 I got an unexcpected error when I run transforming RDD to DataFrame: import nltk from nltk import pos_tag my_rdd_of_lists = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x)) my_df = spark.createDataFrame(my_rdd_of_lists) This error appears always when I call nltk function od rdd. When I made this line with any numpy method, it did not fail. Error code: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException

xgboost in pysparkling water throws an error: XGBoost is not available on all nodes

那年仲夏 提交于 2020-04-18 05:46:08
问题 I am trying to run xgboost from H2O package in a spark cluster. I am using h2o on an on-prem cluster on a Red Hat Enterprise Linux Server, versin:'3.10.0-1062.9.1.el7.x86_64'. I start H2O cluster inside the Spark environment .appName('APP1')\ .config('spark.executor.memory', '15g')\ .config('spark.executor.cores', '8')\ .config('spark.executor.instances','5')\ .config('spark.yarn.queue', "DS")\ .config('spark.yarn.executor.memoryOverhead', '1096')\ .enableHiveSupport()\ .getOrCreate() from

concatenating two columns in pyspark data frame according to alphabets order [duplicate]

萝らか妹 提交于 2020-04-17 22:54:55
问题 This question already has an answer here : how to sort value before concatenate text columns in pyspark (1 answer) Closed 11 days ago . I have a pyspark data Frame with 5M data and I am going to apply fuzzy logic(Levenshtein and Soundex functions) to find duplicates at the first name and last name columns. Inpt data Before that, I want to do resequence first name and last name columns value so that I get correct Levenshtein distance. df = df.withColumn('full_name', f.concat(f.col('first'),f

Write a pyspark.sql.dataframe.DataFrame without losing information

给你一囗甜甜゛ 提交于 2020-04-17 22:53:22
问题 I am trying to save an pyspark.sql.dataframe.DataFrame in CSV format (could also be another format, as long as it is easily readable). So far, I found a couple of examples to save the DataFrame. However, it is losing information everytime that I write it. Dataset example: # Create an example Pyspark DataFrame from pyspark.sql import Row Employee = Row("firstName", "lastName", "email", "salary") employee1 = Employee('A', 'AA', 'mail1', 100000) employee2 = Employee('B', 'BB', 'mail2', 120000 )

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions

淺唱寂寞╮ 提交于 2020-04-17 22:41:43
问题 I got this error while using this code to drop a nested column with pyspark. Why is this not working? I was trying to use a tilde instead of a not != as the error suggests but it doesnt work either. So what do you do in that case? def drop_col(df, struct_nm, delete_struct_child_col_nm): fields_to_keep = filter(lambda x: x != delete_struct_child_col_nm, df.select(" {}.*".format(struct_nm)).columns) fields_to_keep = list(map(lambda x: "{}.{}".format(struct_nm, x), fields_to_keep)) return df

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions

瘦欲@ 提交于 2020-04-17 22:41:16
问题 I got this error while using this code to drop a nested column with pyspark. Why is this not working? I was trying to use a tilde instead of a not != as the error suggests but it doesnt work either. So what do you do in that case? def drop_col(df, struct_nm, delete_struct_child_col_nm): fields_to_keep = filter(lambda x: x != delete_struct_child_col_nm, df.select(" {}.*".format(struct_nm)).columns) fields_to_keep = list(map(lambda x: "{}.{}".format(struct_nm, x), fields_to_keep)) return df

get distinct count from an array of each rows using pyspark

孤人 提交于 2020-04-16 03:31:34
问题 I am looking for distinct counts from an array of each rows using pyspark dataframe: input: col1 [1,1,1] [3,4,5] [1,2,1,2] output: 1 3 2 I used below code but it is giving me the length of an array: output: 3 3 4 please help me how do i achieve this using python pyspark dataframe. slen = udf(lambda s: len(s), IntegerType()) count = Df.withColumn("Count", slen(df.col1)) count.show() Thanks in advanced ! 回答1: For spark2.4+ you can use array_distinct and then just get the size of that, to get

get distinct count from an array of each rows using pyspark

随声附和 提交于 2020-04-16 03:31:14
问题 I am looking for distinct counts from an array of each rows using pyspark dataframe: input: col1 [1,1,1] [3,4,5] [1,2,1,2] output: 1 3 2 I used below code but it is giving me the length of an array: output: 3 3 4 please help me how do i achieve this using python pyspark dataframe. slen = udf(lambda s: len(s), IntegerType()) count = Df.withColumn("Count", slen(df.col1)) count.show() Thanks in advanced ! 回答1: For spark2.4+ you can use array_distinct and then just get the size of that, to get