How should I get the shape of a dask dataframe?

六眼飞鱼酱① 提交于 2019-12-01 14:27:58

问题


Performing .shape is giving me the following error.

AttributeError: 'DataFrame' object has no attribute 'shape'

How should I get the shape instead?


回答1:


You can get the number of columns directly

len(df.columns)  # this is fast

You can also call len on the dataframe itself, though beware that this will trigger a computation.

len(df)  # this requires a full scan of the data

Dask.dataframe doesn't know how many records are in your data without first reading through all of it.




回答2:


To get the shape we can try this way:

 dask_dataframe.describe().compute()  

"count" column of the index will give the number of rows

 len(dask_dataframe.columns) 

this will give the number of columns in the dataframe




回答3:


With shape you can do the following

a = df.shape
a[0].compute(),a[1]

This will shop the shape just as it is shown with pandas




回答4:


Well, I know this is a quite old question, but I had the same issue and I got an out-of-the-box solution which I just want to register here.

Considering your data, I'm wondering that it is originally saved in a CSV similar file; so, for my situation, I just count the lines of that file (minus one, the header line). Inspired by this answer here, this is the solution I'm using:

   import dask.dataframe as dd
   from itertools import (takewhile,repeat)

   def rawincount(filename):
       f = open(filename, 'rb')
       bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
       return sum( buf.count(b'\n') for buf in bufgen )

   filename = 'myHugeDataframe.csv'
   df = dd.read_csv(filename)
   df_shape = (rawincount(filename) - 1, len(df.columns))
   print(f"Shape: {df_shape}")

Hope this could help someone else as well.



来源:https://stackoverflow.com/questions/50355598/how-should-i-get-the-shape-of-a-dask-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!