If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?

陌路散爱 提交于 2020-05-15 10:23:47

问题


This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.

My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?


回答1:


As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.

Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.

However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement

df = df.withColumn()

It will generate another dataframe and assign it to reference "df".

In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.

df.rdd.id()

will give you unique identifier for your dataframe.

I hope the above explanation helps.

Regards,

Neeraj




回答2:


You aren't; the documentation explicitly says

Returns a new Dataset by adding a column or replacing the existing column that has the same name.

If you keep a variable referring to the dataframe you called withColumn on, it won't have the new column.



来源:https://stackoverflow.com/questions/53374140/if-dataframes-in-spark-are-immutable-why-are-we-able-to-modify-it-with-operatio

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!