问题
My original file for training purpose have 25Gb. My machine has 64Gb of RAM. Importing data with default options always ends up in "Memory Error", therefore after reading some posts, I find out that the best option is to define all data types.
For purpose of this question I use a CSV file of: 100.7Mb (it's a mnist data set pulled from https://pjreddie.com/media/files/mnist_train.csv)
When I import it with default options in pandas:
keys = ['pix{}'.format(x) for x in range(1, 785)]
data = pd.read_csv('C:/Users/UI378020/Desktop/mnist_train.csv', header=None, names = ['target'] + keys)
# you can also use directly the data from the internet
#data = pd.read_csv('https://pjreddie.com/media/files/mnist_train.csv',
# header=None, names = ['target'] + keys)
The default dtypes for pandas is:
data.dtypes
How big is it in memory?
import sys
sys.getsizeof(data)/1000000
376.800104
If I changed dtypes to np.int8
values = [np.int8 for x in range(1, 785)]
data = pd.read_csv('C:/Users/UI378020/Desktop/mnist_train.csv', header=None, names = ['target'] + keys,
dtype = dict(zip(keys, values)))
My memory usage decreases to:
47.520104
My question is, what would be even better data type for binary variables to decrease size even more?
回答1:
Referring to the NumPy document here the least possible choice for allocating items in the array/list is "int8" dtype of numpy which has the corresponding "int8_t" in C.
For binary lists / list-like objects, "uint8", "int8", "byte" or "bool" types would yield the same size (allocation) for an item which is 1 byte.
来源:https://stackoverflow.com/questions/57374843/best-data-types-for-binary-variables-in-pandas-csv-import-to-decrease-memory-usa