pickle

Pickling pandas dataframe multiplies by 5 the file size

做~自己de王妃 提交于 2020-01-06 02:47:06
问题 I am reading a 800 Mb CSV file with pandas.read_csv , and then use the original Python pickle.dump(datfarame) to save it. The result is a 4 Gb pkl file, so the CSV size is multiplied by 5. I expected pickle to compress data rather than extend it. Also because I can do a gzip on the CSV file which compress it to 200 Mb, dividing it by 4. I am willing to accelerate the loading time of my program, and thought that pickling would help, but considering disk access is the main bottleneck I am

Python pickle: pickled objects are not equal to source objects

浪子不回头ぞ 提交于 2020-01-05 10:16:31
问题 I think this is expected behaviour but want to check and maybe find out why, as the research I have done has come up blank I have a function that pulls data, creates a new instance of my custom class, then appends it to a list. The class just contains variables. I then pickle that list to a file using protocol 2 as binary, later I re-run the script, re-pull the data from my source, I have a new list with my custom class instances, for testing I keep the data the source data the same. Reload

How to get a python function's dependencies for pickling?

筅森魡賤 提交于 2020-01-05 08:53:51
问题 As a follow up to this question: How to pickle a python function with its dependencies? What is a good approach for determining a method's dependencies? For instance, similar to the above post, if I have a function f that uses methods g and y is there an easy way to get a reference to g and y dynamically? Further, I guess you would want this method to recurse down the entire function graph such that if y depended on z you could also bundle up z. I see that disco uses the following module for

How to create a persistant class using pickle in Python

给你一囗甜甜゛ 提交于 2020-01-05 08:52:17
问题 New to python... I have the following class Key, that extends dict: class Key( dict ): def __init__( self ): self = { some dictionary stuff... } def __getstate__(self): state = self.__dict__.copy() return state def __setstate__(self, state): self.__dict__.update( state ) I want to save an instance of the class with its data using pickle.dump and then retrieve the data using pickle.load. I understand that I am supposed to somehow change the getstate and the setstate, however, am not entirely

Shove failing because of pretty basic ld vs. optimize issue in stuf.util

纵饮孤独 提交于 2020-01-05 05:56:39
问题 I ran against another issue with shove (see Shove knowing about an object but unable to retrieve it ), but this time I've got a pretty simple repro showing why the dump/load doesn't work. Looking at the def in C:\Python27\lib\site-packages\shove-0.5.0-py2.7.egg\shove\base.py for loads/dumps, it refers to ld,optimize in stuf.utils. How come the below does not work? >>> from stuf.utils import ld,optimize; d=[{'A':1},{'A':1}]; ld(optimize(d)) [{'A': 1}, {'A': 1}] >>> from stuf.utils import ld

Can I store a file (HDF5 file) in another file with serialization?

↘锁芯ラ 提交于 2020-01-05 04:41:10
问题 I have a HDF5 file and a list of objects that I need to store for saving functionality. For simplicity I want to create only one save file. Can I store H5 file, in my save file that I create with serialization (pickle) without opening H5 file. 回答1: You can put several files in one by using zipfile or tarfile for zipfile you would write the database files and writestr your pickle.dumps ed data. for tarfile you would add the database file and gettarinfo , addfile your pickle.dump ed data from a

Why do pickle + gzip outperform h5py on repetitive datasets?

故事扮演 提交于 2020-01-05 03:04:43
问题 I am saving a numpy array which contains repetitive data: import numpy as np import gzip import cPickle as pkl import h5py a = np.random.randn(100000, 10) b = np.hstack( [a[cnt:a.shape[0]-10+cnt+1] for cnt in range(10)] ) f_pkl_gz = gzip.open('noise.pkl.gz', 'w') pkl.dump(b, f_pkl_gz, protocol = pkl.HIGHEST_PROTOCOL) f_pkl_gz.close() f_pkl = open('noise.pkl', 'w') pkl.dump(b, f_pkl, protocol = pkl.HIGHEST_PROTOCOL) f_pkl.close() f_hdf5 = h5py.File('noise.hdf5', 'w') f_hdf5.create_dataset('b',

Why do pickle + gzip outperform h5py on repetitive datasets?

点点圈 提交于 2020-01-05 03:04:08
问题 I am saving a numpy array which contains repetitive data: import numpy as np import gzip import cPickle as pkl import h5py a = np.random.randn(100000, 10) b = np.hstack( [a[cnt:a.shape[0]-10+cnt+1] for cnt in range(10)] ) f_pkl_gz = gzip.open('noise.pkl.gz', 'w') pkl.dump(b, f_pkl_gz, protocol = pkl.HIGHEST_PROTOCOL) f_pkl_gz.close() f_pkl = open('noise.pkl', 'w') pkl.dump(b, f_pkl, protocol = pkl.HIGHEST_PROTOCOL) f_pkl.close() f_hdf5 = h5py.File('noise.hdf5', 'w') f_hdf5.create_dataset('b',

How to unpickle a file that has been hosted in a web URL in python

烈酒焚心 提交于 2020-01-05 02:03:19
问题 The normal way to pickle and unpickle an object is as follows: Pickle an object: import cloudpickle as cp cp.dump(objects, open("picklefile.pkl", 'wb')) UnPickle an object: (load the pickled file): loaded_pickle_object = cp.load(open("picklefile.pkl", 'rb')) Now, what if the pickled object is hosted in a server, for example a google drive: I am not able to unpickle the object if I directly provide the URL of that object in the path. The following is not working:I get an IOERROR UnPickle an

Similar errors in MultiProcessing. Mismatch number of arguments to function

ε祈祈猫儿з 提交于 2020-01-04 05:49:44
问题 I couldn't find a better way to describe the error I'm facing, but this error seems to come up everytime I try to implement Multiprocessing to a loop call. I've used both sklearn.externals.joblib as well as multiprocessing.Process but error are similar though different. Original Loop on which want to apply Multiprocessing, where one iteration in executed in single thread/process for dd in final_col_dates: idx1 = final_col_dates.tolist().index(dd) dataObj = GetPrevDataByDate(d1, a, dd, self