Create PySpark dataframe : sequence of months with year

☆樱花仙子☆ 提交于 2020-06-17 13:02:06

问题


Complete newbie here.

I would like to create a dataframe using pyspark that will list month and year taking the current date and listing x number of lines.

if i decide x=5 dataframe should like as below

Calendar_Entry

August 2019<br/>
September 2019<br/>
October 2019<br/>
November 2019<br/>
December 2019

回答1:


Spark is not a tool for generating rows in a distributed way but rather for processing then distributed.
Since your data is small anyway the best solution is probably to create the data in pure python and if required create a spark dataframe out of it.

import datetime
from dateutil.relativedelta import relativedelta


def create_months_df(n_months):
    date_list = [datetime.datetime.today() - relativedelta(months=i) for i in range(n_months)]
    dates_formatted = [(d.strftime("%B"), d.year) for d in date_list]
    return spark.createDataFrame(dates_formatted, ["month", "year"])


来源:https://stackoverflow.com/questions/57426093/create-pyspark-dataframe-sequence-of-months-with-year

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!