Best data types for binary variables in Pandas CSV import to decrease memory usage

问题

My original file for training purpose have 25Gb. My machine has 64Gb of RAM. Importing data with default options always ends up in "Memory Error", therefore after reading some posts, I find out that the best option is to define all data types.

For purpose of this question I use a CSV file of: 100.7Mb (it's a mnist data set pulled from https://pjreddie.com/media/files/mnist_train.csv)

When I import it with default options in pandas:

keys = ['pix{}'.format(x) for x in range(1, 785)]
data = pd.read_csv('C:/Users/UI378020/Desktop/mnist_train.csv', header=None, names = ['target'] + keys)
# you can also use directly the data from the internet
#data = pd.read_csv('https://pjreddie.com/media/files/mnist_train.csv',
#                    header=None, names = ['target'] + keys)

The default dtypes for pandas is:

data.dtypes

How big is it in memory?

import sys
sys.getsizeof(data)/1000000

376.800104

If I changed dtypes to np.int8

values = [np.int8 for x in range(1, 785)]

data = pd.read_csv('C:/Users/UI378020/Desktop/mnist_train.csv', header=None, names = ['target'] + keys, 
                   dtype = dict(zip(keys, values)))

My memory usage decreases to:

47.520104

My question is, what would be even better data type for binary variables to decrease size even more?

回答1:

Referring to the NumPy document here the least possible choice for allocating items in the array/list is "int8" dtype of numpy which has the corresponding "int8_t" in C.

For binary lists / list-like objects, "uint8", "int8", "byte" or "bool" types would yield the same size (allocation) for an item which is 1 byte.

来源：https://stackoverflow.com/questions/57374843/best-data-types-for-binary-variables-in-pandas-csv-import-to-decrease-memory-usa

标签

python

python-3.x

pandas

csv