Python 3 - Can pickle handle byte objects larger than 4GB?

前端 未结 7 1146
我寻月下人不归
我寻月下人不归 2020-12-01 03:54

Based on this comment and the referenced documentation, Pickle 4.0+ from Python 3.4+ should be able to pickle byte objects larger than 4 GB.

However, using python 3

相关标签:
7条回答
  • 2020-12-01 04:37

    Had the same issue and fixed it by upgrading to Python 3.6.8.

    This seems to be the PR that did it: https://github.com/python/cpython/pull/9937

    0 讨论(0)
  • 2020-12-01 04:38

    Reading a file by 2GB chunks takes twice as much memory as needed if bytes concatenation is performed, my approach to loading pickles is based on bytearray:

    class MacOSFile(object):
        def __init__(self, f):
            self.f = f
    
        def __getattr__(self, item):
            return getattr(self.f, item)
    
        def read(self, n):
            if n >= (1 << 31):
                buffer = bytearray(n)
                pos = 0
                while pos < n:
                    size = min(n - pos, 1 << 31 - 1)
                    chunk = self.f.read(size)
                    buffer[pos:pos + size] = chunk
                    pos += size
                return buffer
            return self.f.read(n)
    

    Usage:

    with open("/path", "rb") as fin:
        obj = pickle.load(MacOSFile(fin))
    
    0 讨论(0)
  • 2020-12-01 04:38

    I also found this issue, to solve this problem i chunk the code into several iteration. Let say in this case i have 50.000 data which i have to calc tf-idf and do knn classfication. When i run and directly iterate 50.000 it give me "that error". So, to solve this problem i chunk it.

    tokenized_documents = self.load_tokenized_preprocessing_documents()
        idf = self.load_idf_41227()
        doc_length = len(documents)
        for iteration in range(0, 9):
            tfidf_documents = []
            for index in range(iteration, 4000):
                doc_tfidf = []
                for term in idf.keys():
                    tf = self.term_frequency(term, tokenized_documents[index])
                    doc_tfidf.append(tf * idf[term])
                doc = documents[index]
                tfidf = [doc_tfidf, doc[0], doc[1]]
                tfidf_documents.append(tfidf)
                print("{} from {} document {}".format(index, doc_length, doc[0]))
    
            self.save_tfidf_41227(tfidf_documents, iteration)
    
    0 讨论(0)
  • 2020-12-01 04:41

    Here is a simple workaround for issue 24658. Use pickle.loads or pickle.dumps and break the bytes object into chunks of size 2**31 - 1 to get it in or out of the file.

    import pickle
    import os.path
    
    file_path = "pkl.pkl"
    n_bytes = 2**31
    max_bytes = 2**31 - 1
    data = bytearray(n_bytes)
    
    ## write
    bytes_out = pickle.dumps(data)
    with open(file_path, 'wb') as f_out:
        for idx in range(0, len(bytes_out), max_bytes):
            f_out.write(bytes_out[idx:idx+max_bytes])
    
    ## read
    bytes_in = bytearray(0)
    input_size = os.path.getsize(file_path)
    with open(file_path, 'rb') as f_in:
        for _ in range(0, input_size, max_bytes):
            bytes_in += f_in.read(max_bytes)
    data2 = pickle.loads(bytes_in)
    
    assert(data == data2)
    
    0 讨论(0)
  • 2020-12-01 04:48

    You can specify the protocol for the dump. If you do pickle.dump(obj,file,protocol=4) it should work.

    0 讨论(0)
  • 2020-12-01 04:53

    Here is the full workaround, though it seems pickle.load no longer tries to dump a huge file anymore (I am on Python 3.5.2) so strictly speaking only the pickle.dumps needs this to work properly.

    import pickle
    
    class MacOSFile(object):
    
        def __init__(self, f):
            self.f = f
    
        def __getattr__(self, item):
            return getattr(self.f, item)
    
        def read(self, n):
            # print("reading total_bytes=%s" % n, flush=True)
            if n >= (1 << 31):
                buffer = bytearray(n)
                idx = 0
                while idx < n:
                    batch_size = min(n - idx, 1 << 31 - 1)
                    # print("reading bytes [%s,%s)..." % (idx, idx + batch_size), end="", flush=True)
                    buffer[idx:idx + batch_size] = self.f.read(batch_size)
                    # print("done.", flush=True)
                    idx += batch_size
                return buffer
            return self.f.read(n)
    
        def write(self, buffer):
            n = len(buffer)
            print("writing total_bytes=%s..." % n, flush=True)
            idx = 0
            while idx < n:
                batch_size = min(n - idx, 1 << 31 - 1)
                print("writing bytes [%s, %s)... " % (idx, idx + batch_size), end="", flush=True)
                self.f.write(buffer[idx:idx + batch_size])
                print("done.", flush=True)
                idx += batch_size
    
    
    def pickle_dump(obj, file_path):
        with open(file_path, "wb") as f:
            return pickle.dump(obj, MacOSFile(f), protocol=pickle.HIGHEST_PROTOCOL)
    
    
    def pickle_load(file_path):
        with open(file_path, "rb") as f:
            return pickle.load(MacOSFile(f))
    
    0 讨论(0)
提交回复
热议问题