pickle | 易学教程

Pickling pandas dataframe multiplies by 5 the file size

阅读更多关于 Pickling pandas dataframe multiplies by 5 the file size

问题 I am reading a 800 Mb CSV file with pandas.read_csv , and then use the original Python pickle.dump(datfarame) to save it. The result is a 4 Gb pkl file, so the CSV size is multiplied by 5. I expected pickle to compress data rather than extend it. Also because I can do a gzip on the CSV file which compress it to 200 Mb, dividing it by 4. I am willing to accelerate the loading time of my program, and thought that pickling would help, but considering disk access is the main bottleneck I am

Python pickle: pickled objects are not equal to source objects

阅读更多关于 Python pickle: pickled objects are not equal to source objects

问题 I think this is expected behaviour but want to check and maybe find out why, as the research I have done has come up blank I have a function that pulls data, creates a new instance of my custom class, then appends it to a list. The class just contains variables. I then pickle that list to a file using protocol 2 as binary, later I re-run the script, re-pull the data from my source, I have a new list with my custom class instances, for testing I keep the data the source data the same. Reload

How to get a python function's dependencies for pickling?

阅读更多关于 How to get a python function's dependencies for pickling?

问题 As a follow up to this question: How to pickle a python function with its dependencies? What is a good approach for determining a method's dependencies? For instance, similar to the above post, if I have a function f that uses methods g and y is there an easy way to get a reference to g and y dynamically? Further, I guess you would want this method to recurse down the entire function graph such that if y depended on z you could also bundle up z. I see that disco uses the following module for

How to create a persistant class using pickle in Python

阅读更多关于 How to create a persistant class using pickle in Python

问题 New to python... I have the following class Key, that extends dict: class Key( dict ): def __init__( self ): self = { some dictionary stuff... } def __getstate__(self): state = self.__dict__.copy() return state def __setstate__(self, state): self.__dict__.update( state ) I want to save an instance of the class with its data using pickle.dump and then retrieve the data using pickle.load. I understand that I am supposed to somehow change the getstate and the setstate, however, am not entirely

Shove failing because of pretty basic ld vs. optimize issue in stuf.util

阅读更多关于 Shove failing because of pretty basic ld vs. optimize issue in stuf.util

问题 I ran against another issue with shove (see Shove knowing about an object but unable to retrieve it ), but this time I've got a pretty simple repro showing why the dump/load doesn't work. Looking at the def in C:\Python27\lib\site-packages\shove-0.5.0-py2.7.egg\shove\base.py for loads/dumps, it refers to ld,optimize in stuf.utils. How come the below does not work? >>> from stuf.utils import ld,optimize; d=[{'A':1},{'A':1}]; ld(optimize(d)) [{'A': 1}, {'A': 1}] >>> from stuf.utils import ld

Can I store a file (HDF5 file) in another file with serialization?

阅读更多关于 Can I store a file (HDF5 file) in another file with serialization?

问题 I have a HDF5 file and a list of objects that I need to store for saving functionality. For simplicity I want to create only one save file. Can I store H5 file, in my save file that I create with serialization (pickle) without opening H5 file. 回答1: You can put several files in one by using zipfile or tarfile for zipfile you would write the database files and writestr your pickle.dumps ed data. for tarfile you would add the database file and gettarinfo , addfile your pickle.dump ed data from a

Why do pickle + gzip outperform h5py on repetitive datasets?

阅读更多关于 Why do pickle + gzip outperform h5py on repetitive datasets?

问题 I am saving a numpy array which contains repetitive data: import numpy as np import gzip import cPickle as pkl import h5py a = np.random.randn(100000, 10) b = np.hstack( [a[cnt:a.shape[0]-10+cnt+1] for cnt in range(10)] ) f_pkl_gz = gzip.open('noise.pkl.gz', 'w') pkl.dump(b, f_pkl_gz, protocol = pkl.HIGHEST_PROTOCOL) f_pkl_gz.close() f_pkl = open('noise.pkl', 'w') pkl.dump(b, f_pkl, protocol = pkl.HIGHEST_PROTOCOL) f_pkl.close() f_hdf5 = h5py.File('noise.hdf5', 'w') f_hdf5.create_dataset('b',

Why do pickle + gzip outperform h5py on repetitive datasets?

阅读更多关于 Why do pickle + gzip outperform h5py on repetitive datasets?

How to unpickle a file that has been hosted in a web URL in python

阅读更多关于 How to unpickle a file that has been hosted in a web URL in python

问题 The normal way to pickle and unpickle an object is as follows: Pickle an object: import cloudpickle as cp cp.dump(objects, open("picklefile.pkl", 'wb')) UnPickle an object: (load the pickled file): loaded_pickle_object = cp.load(open("picklefile.pkl", 'rb')) Now, what if the pickled object is hosted in a server, for example a google drive: I am not able to unpickle the object if I directly provide the URL of that object in the path. The following is not working:I get an IOERROR UnPickle an

Similar errors in MultiProcessing. Mismatch number of arguments to function

阅读更多关于 Similar errors in MultiProcessing. Mismatch number of arguments to function

问题 I couldn't find a better way to describe the error I'm facing, but this error seems to come up everytime I try to implement Multiprocessing to a loop call. I've used both sklearn.externals.joblib as well as multiprocessing.Process but error are similar though different. Original Loop on which want to apply Multiprocessing, where one iteration in executed in single thread/process for dd in final_col_dates: idx1 = final_col_dates.tolist().index(dd) dataObj = GetPrevDataByDate(d1, a, dd, self