问题
Complete newbie here.
I would like to create a dataframe using pyspark that will list month and year taking the current date and listing x number of lines.
if i decide x=5
dataframe should like as below
Calendar_Entry
August 2019<br/>
September 2019<br/>
October 2019<br/>
November 2019<br/>
December 2019
回答1:
Spark is not a tool for generating rows in a distributed way but rather for processing then distributed.
Since your data is small anyway the best solution is probably to create the data in pure python and if required create a spark dataframe out of it.
import datetime
from dateutil.relativedelta import relativedelta
def create_months_df(n_months):
date_list = [datetime.datetime.today() - relativedelta(months=i) for i in range(n_months)]
dates_formatted = [(d.strftime("%B"), d.year) for d in date_list]
return spark.createDataFrame(dates_formatted, ["month", "year"])
来源:https://stackoverflow.com/questions/57426093/create-pyspark-dataframe-sequence-of-months-with-year