How should I get the shape of a dask dataframe?

若如初见. 提交于 2019-12-01 15:46:47

You can get the number of columns directly

len(df.columns)  # this is fast

You can also call len on the dataframe itself, though beware that this will trigger a computation.

len(df)  # this requires a full scan of the data

Dask.dataframe doesn't know how many records are in your data without first reading through all of it.

To get the shape we can try this way:

 dask_dataframe.describe().compute()  

"count" column of the index will give the number of rows

 len(dask_dataframe.columns) 

this will give the number of columns in the dataframe

With shape you can do the following

a = df.shape
a[0].compute(),a[1]

This will shop the shape just as it is shown with pandas

Well, I know this is a quite old question, but I had the same issue and I got an out-of-the-box solution which I just want to register here.

Considering your data, I'm wondering that it is originally saved in a CSV similar file; so, for my situation, I just count the lines of that file (minus one, the header line). Inspired by this answer here, this is the solution I'm using:

   import dask.dataframe as dd
   from itertools import (takewhile,repeat)

   def rawincount(filename):
       f = open(filename, 'rb')
       bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
       return sum( buf.count(b'\n') for buf in bufgen )

   filename = 'myHugeDataframe.csv'
   df = dd.read_csv(filename)
   df_shape = (rawincount(filename) - 1, len(df.columns))
   print(f"Shape: {df_shape}")

Hope this could help someone else as well.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!