PySpark 2.0 The size or shape of a DataFrame

会有一股神秘感。 提交于 2020-06-24 03:03:30

问题


I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this.

In Python I can do

data.shape()

Is there a similar function in PySpark. This is my current solution, but I am looking for an element one

row_number = data.count()
column_number = len(data.dtypes)

The computation of the number of columns is not ideal...


回答1:


print((df.count(), len(df.columns)))



回答2:


Use df.count() to get the number of rows.




回答3:


Add this to the your code:

def spark_shape(self):
    return (self.count(), len(self.columns))
pyspark.sql.dataframe.DataFrame.shape = spark_shape

Then you can do

>>> df.shape()
(10000, 10)

But just remind you that .count() can be very slow for very large datasets.




回答4:


print((df.count(), len(df.columns)))

is easier for smaller datasets.

However if the dataset is huge, an alternative approach would be to use pandas and arrows to convert the dataframe to pandas df and call shape

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.crossJoin.enabled", "true")
print(df.toPandas().shape)



回答5:


I think there is not similar function like data.shape in Spark. But I will use len(data.columns) rather than len(data.dtypes)



来源:https://stackoverflow.com/questions/39652767/pyspark-2-0-the-size-or-shape-of-a-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!