Best data types for binary variables in Pandas CSV import to decrease memory usage

余生颓废 提交于 2021-01-28 05:33:07

问题


My original file for training purpose have 25Gb. My machine has 64Gb of RAM. Importing data with default options always ends up in "Memory Error", therefore after reading some posts, I find out that the best option is to define all data types.

For purpose of this question I use a CSV file of: 100.7Mb (it's a mnist data set pulled from https://pjreddie.com/media/files/mnist_train.csv)

When I import it with default options in pandas:

keys = ['pix{}'.format(x) for x in range(1, 785)]
data = pd.read_csv('C:/Users/UI378020/Desktop/mnist_train.csv', header=None, names = ['target'] + keys)
# you can also use directly the data from the internet
#data = pd.read_csv('https://pjreddie.com/media/files/mnist_train.csv',
#                    header=None, names = ['target'] + keys)

The default dtypes for pandas is:

data.dtypes

How big is it in memory?

import sys
sys.getsizeof(data)/1000000

376.800104

If I changed dtypes to np.int8

values = [np.int8 for x in range(1, 785)]

data = pd.read_csv('C:/Users/UI378020/Desktop/mnist_train.csv', header=None, names = ['target'] + keys, 
                   dtype = dict(zip(keys, values)))

My memory usage decreases to:

47.520104

My question is, what would be even better data type for binary variables to decrease size even more?


回答1:


Referring to the NumPy document here the least possible choice for allocating items in the array/list is "int8" dtype of numpy which has the corresponding "int8_t" in C.

For binary lists / list-like objects, "uint8", "int8", "byte" or "bool" types would yield the same size (allocation) for an item which is 1 byte.



来源:https://stackoverflow.com/questions/57374843/best-data-types-for-binary-variables-in-pandas-csv-import-to-decrease-memory-usa

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!