How to get name of dataframe column in pyspark?

耗尽温柔 提交于 2019-12-09 04:19:54

问题


In pandas, this can be done by column.name.

But how to do the same when its column of spark dataframe?

e.g. The calling program has a spark dataframe: spark_df

>>> spark_df.columns
['admit', 'gre', 'gpa', 'rank']

This program calls my function: my_function(spark_df['rank']) In my_function, I need the name of the column i.e. 'rank'

If it was pandas dataframe, we can use inside my_function

>>> pandas_df['rank'].name
'rank'

回答1:


You can get the names from the schema by doing

spark_df.schema.names

Printing the schema can be useful to visualize it as well

spark_df.printSchema()



回答2:


The only way is to go an underlying level to the JVM.

df.col._jc.toString().encode('utf8')

This is also how it is converted to a str in the pyspark code itself.

From pyspark/sql/column.py:

def __repr__(self):
    return 'Column<%s>' % self._jc.toString().encode('utf8')



回答3:


If you want the column names of your dataframe, you can use the pyspark.sql class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:

>>> df.columns['High'] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str

However, calling the columns method on your dataframe, which you have done, will return a list of column names:

df.columns will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

If you want the column datatypes, you can call the dtypes method:

df.dtypes will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]

If you want a particular column, you'll need to access it by index:

df.columns[2] will return 'High'




回答4:


I found the answer is very very simple...

// It is in java, but it should be same in pyspark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();

The variable "theNameOftheCol" is "colName"



来源:https://stackoverflow.com/questions/39746752/how-to-get-name-of-dataframe-column-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!