pyspark

How to drop multiple column names given in a list from Spark DataFrame?

耗尽温柔 提交于 2020-05-25 12:15:50
问题 I have a dynamic list which is created based on value of n. n = 3 drop_lst = ['a' + str(i) for i in range(n)] df.drop(drop_lst) But the above is not working. Note : My use case requires a dynamic list. If I just do the below without list it works df.drop('a0','a1','a2') How do I make drop function work with list? Spark 2.2 doesn't seem to have this capability. Is there a way to make it work without using select() ? 回答1: You can use the * operator to pass the contents of your list as arguments

How to convert date to the first day of month in a PySpark Dataframe column?

此生再无相见时 提交于 2020-05-23 12:50:26
问题 I have the following DataFrame: +----------+ | date| +----------+ |2017-01-25| |2017-01-21| |2017-01-12| +----------+ Here is the code the create above DataFrame: import pyspark.sql.functions as f rdd = sc.parallelize([("2017/11/25",), ("2017/12/21",), ("2017/09/12",)]) df = sqlContext.createDataFrame(rdd, ["date"]).withColumn("date", f.to_date(f.col("date"), "yyyy/MM/dd")) df.show() I want a new column with the first date of month for each row, just replace the day to "01" in all the dates +

How to convert date to the first day of month in a PySpark Dataframe column?

扶醉桌前 提交于 2020-05-23 12:48:30
问题 I have the following DataFrame: +----------+ | date| +----------+ |2017-01-25| |2017-01-21| |2017-01-12| +----------+ Here is the code the create above DataFrame: import pyspark.sql.functions as f rdd = sc.parallelize([("2017/11/25",), ("2017/12/21",), ("2017/09/12",)]) df = sqlContext.createDataFrame(rdd, ["date"]).withColumn("date", f.to_date(f.col("date"), "yyyy/MM/dd")) df.show() I want a new column with the first date of month for each row, just replace the day to "01" in all the dates +

pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python

淺唱寂寞╮ 提交于 2020-05-23 09:04:45
问题 I am trying to I am tring to delete stop words via spark,the code is as follow from nltk.corpus import stopwords from pyspark.context import SparkContext from pyspark.sql.session import SparkSession sc = SparkContext('local') spark = SparkSession(sc) word_list=["ourselves","out","over", "own", "same" ,"shan't" ,"she", "she'd", "what", "the", "fuck", "is", "this","world","too","who","who's","whom","yours","yourself","yourselves"] wordlist=spark.createDataFrame([word_list]).rdd def stopwords

Differences between null and NaN in spark? How to deal with it?

不羁岁月 提交于 2020-05-22 17:42:47
问题 In my DataFrame, there are columns including values of null and NaN respectively, such as: df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b")) df.show() +----+---+ | a| b| +----+---+ | 1|NaN| |null|1.0| +----+---+ Are there any difference between those? How can they be dealt with? 回答1: null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists. NaN stands for "Not a Number", it's usually

Populating column in dataframe with pySpark

蹲街弑〆低调 提交于 2020-05-22 10:02:25
问题 new to pySpark and I'm trying to fill a column based on conditions using a list. How can I fill a column based conditions using a list? Python logic if matchedPortfolios == 0: print("ALL") else: print(Portfolios) pySpark attempt with error #Check matching column values in order to find common portfolio names Portfolios = set (portfolio_DomainItemLookup) & set(portfolio_dataset_standardFalse) Portfolios #prints list of matched names OR prints empty list matchedPortfolios = len(Portfolios)

Spark Window Functions - rangeBetween dates

岁酱吖の 提交于 2020-05-21 01:56:08
问题 I am having a Spark SQL DataFrame with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Function like: Window \ .partitionBy('id') \ .orderBy('start') and here comes the problem. I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just

pySpark mapping multiple columns

▼魔方 西西 提交于 2020-05-19 17:50:33
问题 I need to be able to compare two dataframes using multiple columns. pySpark attempt I decided to filter the reference dataframe by one level (reference_df. PrimaryLookupAttributeName compare to df1.LeaseStatus) How can I iterate over the list of primaryLookupAttributeName_List and avoid hardcoding, LeaseStatus ? get PrimaryLookupAttributeValue values from reference table in a dictionary to compare them to df1. output a new df with the found/match values. I decided to hard code, FOUND because

How to store JSON dataframe with comma sepearted

心已入冬 提交于 2020-05-17 08:46:46
问题 I need to write record of dataframe to a json file. If I write the dataframe into the file it stores like {"a":1} {"b":2} , I want to write the dataframe like this [{"a":1} ,{"b":2}] . Can you please help me. Thanks in advance. 回答1: Use to_json function to create array of json objects then use .saveAsTextFile to save the json object. Example: #sample dataframe df=spark.createDataFrame([("a",1),("b",2)],["id","name"]) from pyspark.sql.functions import * df.groupBy(lit("1")).\ agg(collect_list

How to store JSON dataframe with comma sepearted

ぐ巨炮叔叔 提交于 2020-05-17 08:46:18
问题 I need to write record of dataframe to a json file. If I write the dataframe into the file it stores like {"a":1} {"b":2} , I want to write the dataframe like this [{"a":1} ,{"b":2}] . Can you please help me. Thanks in advance. 回答1: Use to_json function to create array of json objects then use .saveAsTextFile to save the json object. Example: #sample dataframe df=spark.createDataFrame([("a",1),("b",2)],["id","name"]) from pyspark.sql.functions import * df.groupBy(lit("1")).\ agg(collect_list