pyspark | 易学教程

How to drop multiple column names given in a list from Spark DataFrame?

阅读更多关于 How to drop multiple column names given in a list from Spark DataFrame?

问题 I have a dynamic list which is created based on value of n. n = 3 drop_lst = ['a' + str(i) for i in range(n)] df.drop(drop_lst) But the above is not working. Note : My use case requires a dynamic list. If I just do the below without list it works df.drop('a0','a1','a2') How do I make drop function work with list? Spark 2.2 doesn't seem to have this capability. Is there a way to make it work without using select() ? 回答1: You can use the * operator to pass the contents of your list as arguments

How to convert date to the first day of month in a PySpark Dataframe column?

阅读更多关于 How to convert date to the first day of month in a PySpark Dataframe column?

问题 I have the following DataFrame: +----------+ | date| +----------+ |2017-01-25| |2017-01-21| |2017-01-12| +----------+ Here is the code the create above DataFrame: import pyspark.sql.functions as f rdd = sc.parallelize([("2017/11/25",), ("2017/12/21",), ("2017/09/12",)]) df = sqlContext.createDataFrame(rdd, ["date"]).withColumn("date", f.to_date(f.col("date"), "yyyy/MM/dd")) df.show() I want a new column with the first date of month for each row, just replace the day to "01" in all the dates +

How to convert date to the first day of month in a PySpark Dataframe column?

阅读更多关于 How to convert date to the first day of month in a PySpark Dataframe column?

pickle.PicklingError: args[0] from newobj args has the wrong class with hadoop python

阅读更多关于 pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python

问题 I am trying to I am tring to delete stop words via spark,the code is as follow from nltk.corpus import stopwords from pyspark.context import SparkContext from pyspark.sql.session import SparkSession sc = SparkContext('local') spark = SparkSession(sc) word_list=["ourselves","out","over", "own", "same" ,"shan't" ,"she", "she'd", "what", "the", "fuck", "is", "this","world","too","who","who's","whom","yours","yourself","yourselves"] wordlist=spark.createDataFrame([word_list]).rdd def stopwords

Differences between null and NaN in spark? How to deal with it?

阅读更多关于 Differences between null and NaN in spark? How to deal with it?

问题 In my DataFrame, there are columns including values of null and NaN respectively, such as: df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b")) df.show() +----+---+ | a| b| +----+---+ | 1|NaN| |null|1.0| +----+---+ Are there any difference between those? How can they be dealt with? 回答1: null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists. NaN stands for "Not a Number", it's usually

Populating column in dataframe with pySpark

阅读更多关于 Populating column in dataframe with pySpark

问题 new to pySpark and I'm trying to fill a column based on conditions using a list. How can I fill a column based conditions using a list? Python logic if matchedPortfolios == 0: print("ALL") else: print(Portfolios) pySpark attempt with error #Check matching column values in order to find common portfolio names Portfolios = set (portfolio_DomainItemLookup) & set(portfolio_dataset_standardFalse) Portfolios #prints list of matched names OR prints empty list matchedPortfolios = len(Portfolios)

Spark Window Functions - rangeBetween dates

阅读更多关于 Spark Window Functions - rangeBetween dates

问题 I am having a Spark SQL DataFrame with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Function like: Window \ .partitionBy('id') \ .orderBy('start') and here comes the problem. I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just

pySpark mapping multiple columns

阅读更多关于 pySpark mapping multiple columns

问题 I need to be able to compare two dataframes using multiple columns. pySpark attempt I decided to filter the reference dataframe by one level (reference_df. PrimaryLookupAttributeName compare to df1.LeaseStatus) How can I iterate over the list of primaryLookupAttributeName_List and avoid hardcoding, LeaseStatus ? get PrimaryLookupAttributeValue values from reference table in a dictionary to compare them to df1. output a new df with the found/match values. I decided to hard code, FOUND because

How to store JSON dataframe with comma sepearted

阅读更多关于 How to store JSON dataframe with comma sepearted

问题 I need to write record of dataframe to a json file. If I write the dataframe into the file it stores like {"a":1} {"b":2} , I want to write the dataframe like this [{"a":1} ,{"b":2}] . Can you please help me. Thanks in advance. 回答1: Use to_json function to create array of json objects then use .saveAsTextFile to save the json object. Example: #sample dataframe df=spark.createDataFrame([("a",1),("b",2)],["id","name"]) from pyspark.sql.functions import * df.groupBy(lit("1")).\ agg(collect_list

How to store JSON dataframe with comma sepearted

阅读更多关于 How to store JSON dataframe with comma sepearted