How to estimate how much memory a Pandas' DataFrame will need?

前端未结

关注

 7  604

I have been wondering... If I am reading, say, a 400MB csv file into a pandas dataframe (using read_csv or read_table), is there any way to guesstimate how much memory this

相关标签:

7条回答

不思量自难忘°

2020-11-30 19:17
Here's a comparison of the different methods - sys.getsizeof(df) is simplest.

For this example, df is a dataframe with 814 rows, 11 columns (2 ints, 9 objects) - read from a 427kb shapefile

sys.getsizeof(df)
```
>>> import sys
>>> sys.getsizeof(df)
(gives results in bytes)
462456
```
df.memory_usage()
```
>>> df.memory_usage()
...
(lists each column at 8 bytes/row)

>>> df.memory_usage().sum()
71712
(roughly rows * cols * 8 bytes)

>>> df.memory_usage(deep=True)
(lists each column's full memory usage)

>>> df.memory_usage(deep=True).sum()
(gives results in bytes)
462432
```
df.info()

Prints dataframe info to stdout. Technically these are kibibytes (KiB), not kilobytes - as the docstring says, "Memory usage is shown in human-readable units (base-2 representation)." So to get bytes would multiply by 1024, e.g. 451.6 KiB = 462,438 bytes.
```
>>> df.info()
...
memory usage: 70.0+ KB

>>> df.info(memory_usage='deep')
...
memory usage: 451.6 KB
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-11-30 19:19
If you know the dtypes of your array then you can directly compute the number of bytes that it will take to store your data + some for the Python objects themselves. A useful attribute of numpy arrays is nbytes. You can get the number of bytes from the arrays in a pandas DataFrame by doing
```
nbytes = sum(block.values.nbytes for block in df.blocks.values())
```
object dtype arrays store 8 bytes per object (object dtype arrays store a pointer to an opaque PyObject), so if you have strings in your csv you need to take into account that read_csv will turn those into object dtype arrays and adjust your calculations accordingly.

EDIT:

See the numpy scalar types page for more details on the object dtype. Since only a reference is stored you need to take into account the size of the object in the array as well. As that page says, object arrays are somewhat similar to Python list objects.
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦毁少年i

2020-11-30 19:28

I thought I would bring some more data to the discussion.

I ran a series of tests on this issue.

By using the python resource package I got the memory usage of my process.

And by writing the csv into a StringIO buffer, I could easily measure the size of it in bytes.

I ran two experiments, each one creating 20 dataframes of increasing sizes between 10,000 lines and 1,000,000 lines. Both having 10 columns.

In the first experiment I used only floats in my dataset.

This is how the memory increased in comparison to the csv file as a function of the number of lines. (Size in Megabytes)

The second experiment I had the same approach, but the data in the dataset consisted of only short strings.

It seems that the relation of the size of the csv and the size of the dataframe can vary quite a lot, but the size in memory will always be bigger by a factor of 2-3 (for the frame sizes in this experiment)

I would love to complete this answer with more experiments, please comment if you want me to try something special.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2020-11-30 19:31

Yes there is. Pandas will store your data in 2 dimensional numpy ndarray structures grouping them by dtypes. ndarray is basically a raw C array of data with a small header. So you can estimate it's size just by multiplying the size of the dtype it contains with the dimensions of the array.

For example: if you have 1000 rows with 2 np.int32 and 5 np.float64 columns, your DataFrame will have one 2x1000 np.int32 array and one 5x1000 np.float64 array which is:

4bytes*2*1000 + 8bytes*5*1000 = 48000 bytes

0 讨论(0)
发布评论:

提交评论
- 加载中...
孤城傲影

2020-11-30 19:35
df.memory_usage() will return how many bytes each column occupies:
```
>>> df.memory_usage()

Row_ID            20906600
Household_ID      20906600
Vehicle           20906600
Calendar_Year     20906600
Model_Year        20906600
...
```
To include indexes, pass index=True.

So to get overall memory consumption:
```
>>> df.memory_usage(index=True).sum()
731731000
```
Also, passing deep=True will enable a more accurate memory usage report, that accounts for the full usage of the contained objects.

This is because memory usage does not include memory consumed by elements that are not components of the array if deep=False (default case).
0 讨论(0)
发布评论:

提交评论
- 加载中...
忘了有多久

2020-11-30 19:35
This I believe this gives the in-memory size any object in python. Internals need to be checked with regard to pandas and numpy
```
>>> import sys
#assuming the dataframe to be df 
>>> sys.getsizeof(df) 
59542497
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页

How to estimate how much memory a Pandas' DataFrame will need?

sys.getsizeof(df)

df.memory_usage()

df.info()