PySpark DataFrame Column Reference: df.col vs. df['col'] vs. F.col('col')?

痴心易碎 提交于 2019-12-28 06:51:46

问题


I have a concept I hope you can help to clarify:

What's the difference between the following three ways of referring to a column in PySpark dataframe. I know different situations need different forms, but not sure why.

  1. df.col: e.g. F.count(df.col)
  2. df['col']: e.g. df['col'] == 0
  3. F.col('col'): e.g. df.filter(F.col('col').isNull())

Thanks a lot!


回答1:


In most practical applictions, there is almost no difference. However, they are implemented by calls to different underlying functions (source) and thus are not exactly the same.

We can illustrate with a small example:

df = spark.createDataFrame(
    [(1,'a', 0), (2,'b',None), (None,'c',3)], 
    ['col', '2col', 'third col']
)

df.show()
#+----+----+---------+
#| col|2col|third col|
#+----+----+---------+
#|   1|   a|        0|
#|   2|   b|     null|
#|null|   c|        3|
#+----+----+---------+

1. df.col

This is the least flexible. You can only reference columns that are valid to be accessed using the . operator. This rules out column names containing spaces or special characters and column names that start with an integer.

This syntax makes a call to df.__getattr__("col").

print(df.__getattr__.__doc__)
#Returns the :class:`Column` denoted by ``name``.
#
#        >>> df.select(df.age).collect()
#        [Row(age=2), Row(age=5)]
#
#        .. versionadded:: 1.3

Using the . syntax, you can only access the first column of this example dataframe.

>>> df.2col
  File "<ipython-input-39-8e82c2dd5b7c>", line 1
    df.2col
       ^
SyntaxError: invalid syntax

Under the hood, it checks to see if the column name is contained in df.columns and then returns the pyspark.sql.Column specified.

2. df["col"]

This makes a call to df.__getitem__. You have some more flexibility in that you can do everything that __getattr__ can do, plus you can specify any column name.

df["2col"]
#Column<2col> 

Once again, under the hood some conditionals are checked and in this case the pyspark.sql.Column specified by the input string is returned.

In addition, you can as pass in multiple columns (as a list or tuple) or column expressions.

from pyspark.sql.functions import expr
df[['col', expr('`third col` IS NULL')]].show()
#+----+-------------------+
#| col|(third col IS NULL)|
#+----+-------------------+
#|   1|              false|
#|   2|               true|
#|null|              false|
#+----+-------------------+

Note that in the case of multiple columns, __getitem__ is just making a call to pyspark.sql.DataFrame.select.

Finally, you can also access columns by index:

df[2]
#Column<third col>

3. pyspark.sql.functions.col

This returns a Column based on the given name. This is useful shorthand when you need to specify that you want a column and not a string literal.

For example, supposed we wanted to make a new column that would take on the either the value from "col" or "third col" based on the value of "2col":

from pyspark.sql.functions import when

df.withColumn(
    'new', 
    f.when(df['2col'].isin(['a', 'c']), 'third col').otherwise('col')
).show()
#+----+----+---------+---------+
#| col|2col|third col|      new|
#+----+----+---------+---------+
#|   1|   a|        0|third col|
#|   2|   b|     null|      col|
#|null|   c|        3|third col|
#+----+----+---------+---------+

Oops, that's not what I meant. Spark thought I wanted the literal strings "col" and "third col". Instead, what I should have written is:

from pyspark.sql.functions import col
df.withColumn(
    'new', 
    when(df['2col'].isin(['a', 'c']), col('third col')).otherwise(col('col'))
).show()
#+----+----+---------+---+
#| col|2col|third col|new|
#+----+----+---------+---+
#|   1|   a|        0|  0|
#|   2|   b|     null|  2|
#|null|   c|        3|  3|
#+----+----+---------+---+


来源:https://stackoverflow.com/questions/55105363/pyspark-dataframe-column-reference-df-col-vs-dfcol-vs-f-colcol

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!