Django loaddata - Out of Memory

问题

I made a dump of my db using dumpdata and it created a 500mb json file

now I am trying to use loaddata to restore the db, but seems like Django tries to load the entire file into memory before applying it and i get an out of memory error and the process is killed.

Isn't there a way to bypass this problem?

回答1:

loaddata is generally use for fixtures, i.e. a small number of database objects to get your system started and for tests rather than for large chunks of data. If you're hitting memory limits then you're probably not using it for the right purpose.

If you still have the original database, you should use something more suited to the purpose, like PostgreSQL's pg_dump or MySQL's mysqldump.

回答2:

As Joe pointed out, PostgreSQL's pg_dump or MySQL's mysqldump is more suited in your case.

In case you have lost your original database, there are 2 ways you could try to get your data back:

One: Find another machine, that have more memory and can access to your database. Build your project on that machine, and run the loaddata command on that machine.

I know it sounds silly. But it is the quickest way if your can run django on your laptop and can connect to the db remotely.

Two: Hack the Django source code.

Check the code in django.core.erializers.json.py:

def Deserializer(stream_or_string, **options):
    """
    Deserialize a stream or string of JSON data.
    """
    if not isinstance(stream_or_string, (bytes, six.string_types)):
        stream_or_string = stream_or_string.read()
    if isinstance(stream_or_string, bytes):
        stream_or_string = stream_or_string.decode('utf-8')
    try:
        objects = json.loads(stream_or_string)
        for obj in PythonDeserializer(objects, **options):
            yield obj
    except GeneratorExit:
        raise
    except Exception as e:
        # Map to deserializer error
        six.reraise(DeserializationError, DeserializationError(e), sys.exc_info()[2])

The code below is the problem. The json module in the stdlib only accepts string, and cant not handle stream lazily. So django load all the content of a json file into the memory.

stream_or_string = stream_or_string.read()
objects = json.loads(stream_or_string)

You could optimize those code with py-yajl. py-yajl creates an alternative to the built in json.loads and json.dumps using yajl.

回答3:

I'd like to add that I was quite successful in a similar use-case with ijson: https://github.com/isagalaev/ijson

In order to get an iterator over the objects in a json file from django dumpdata, I modified the json Deserializer like this (imports elided):

Serializer = django.core.serializers.json.Serializer


def Deserializer(stream_or_string, **options):

    if isinstance(stream_or_string, six.string_types):
        stream_or_string = six.BytesIO(stream_or_string.encode('utf-8'))
    try:
        objects = ijson.items(stream_or_string, 'item')
        for obj in PythonDeserializer(objects, **options):
            yield obj
    except GeneratorExit:
        raise
    except Exception as e:
        # Map to deserializer error
        six.reraise(DeserializationError, DeserializationError(e), sys.exc_info()[2])

The problem with using py-yajl as-is is that you still get all the objects in one large array, which uses a lot of memory. This loop only uses as much memory as a single serialized Django object. Also ijson can still use yajl as a backend.

回答4:

I ran into this problem migrating data from a Microsoft SQL Server to PostgreSQL, so sqldump and pg_dump weren't an option for me. I split my json fixtures into chunks that would fit in memory (about 1M rows for a wide table and 64GB ram).

def dump_json(model, batch_len=1000000):
    "Dump database records to a json file in Django fixture format, one file for each batch of 1M records"
    JSONSerializer = serializers.get_serializer("json")
    jser = JSONSerializer()
    for i, partial_qs in enumerate(util.generate_slices(model.objects.all(), batch_len=batch_len)):
        with open(model._meta.app_label + '--' + model._meta.object_name + '--%04d.json' % i, 'w') as fpout:
            jser.serialize(partial_qs, indent=1, stream=fpout)

You can then load them with manage.py loaddata <app_name>--<model_name>*.json. But in my case I had to first sed the files to change the model and app names so they'd load to the right database. I also nulled the pk because I'd changed the pk to be an AutoField (best practice for django).

sed -e 's/^\ \"pk\"\:\ \".*\"\,/"pk": null,/g' -i *.json
sed -e 's/^\ \"model\"\:\ \"old_app_name\.old_model_name\"\,/\ \"model\"\:\ "new_app_name\.new_model_name\"\,/g' -i *.json

You might find pug useful. It's a FOSS python package of similarly hacky tools for handling large migration and data mining tasks in django.

回答5:

You can use XML format for serialization/deserialization. It's implemented internally via file streams and doesn't require a lot of memory in comparison with JSON. Unfortunately, Django JSON Deserialization doesn't use streams

So just try:

./manage.py dumpdata file.xml

and then

./manage.py loaddata file.xml

回答6:

Had a problem as well with pg_dump/pg_restore due to constraints applied on some fields.

In my case I'm running django on aws lambda via zappa, and wanted to migrate to aurora serverless (postgres). I had the dumpdata file generated from a bastion t2.micro instance, but when I tried loaddata the micro instance didn't have enough memory and the process got Killed by the os on loaddata attempt.

So I need to chunk the data so that it can be handled in memory on the instance, and due to the field constraints the records need to be loaded in a certain order. (if not I get errors that the linked record doesn't exist).

So here's my script to chunk the dumpdata and chunk it in an order that can be loaded successfully without constraint related errors:

NOTE: This is prepared on a machine with enough memory to hold all the data in memory. So the results of this script were transferred to the t2.micro instance after creation, where the loaddata was run in the resulting order.

import json
from typing import List
from collections import Counter, defaultdict
from pathlib import Path

working_directory = Path.home() / 'migratedb' 
dumpdata_filepath = working_directory / 'db_backup_20190830.json'

def chunk_dumpdata_json(dumpdata_filepath: Path, app_model_order: List):
    file_creation_order = []
    max_chunk_records = 25000    
    with dumpdata_filepath.open('r') as data_in:
        all_data = json.loads(data_in.read())        
        print(f'record count: {len(all_data)}')
        model_records = defaultdict(list)    
        for total_count, record in enumerate(all_data):                        
            app_model_name = record['model']
            assert app_model_name in app_model_order, f'{app_model_name} not in: {app_model_order}'
            model_records[app_model_name].append(record)


        # chunk by model order
        total_record_count = 0
        chunks = defaultdict(list)
        for app_model in app_model_order:            
            for record in model_records[app_model]:                
                record_chunk = total_record_count - (total_record_count % max_chunk_records)            
                chunks[record_chunk].append(record)
                total_record_count += 1

        for chunk, records in chunks.items():            
            chunk_filename = f'dumpdata_v1_chunk{chunk}.json'
            chunk_filepath = working_directory / chunk_filename
            print(chunk_filepath.name)
            file_creation_order.append(chunk_filepath.name)
            with chunk_filepath.open('w', encoding='utf8') as out:
                out.write(json.dumps(records))        
    return file_creation_order

app_model_order = (
    'app.model1',
    'app.model2',    
)                

result_file_creation_order = chunk_dumpdata_json(dumpdata_filepath, app_model_order)

Then I took the output of the script below, saved it to loaddata.sh and ran it:

for item in result_file_creation_order:
    if item:
        print(f'echo "Loading {item} ..."')
        print(f'python manage.py loaddata ~/migrationdata/{item}')

来源：https://stackoverflow.com/questions/23047766/django-loaddata-out-of-memory

标签

python

django

database-restore