What is the best way to load a massive amount of data into PostgreSQL?

问题

I want to load a massive amount of data into PostgreSQL. Do you know any other "tricks" apart from the ones mentioned in the PostgreSQL's documentation?

What have I done up to now?

1) set the following parameters in postgresql.conf (for 64 GB of RAM):

    shared_buffers = 26GB 
    work_mem=40GB
    maintenance_work_mem = 10GB       #  min 1MB default: 16 MB
    effective_cache_size = 48GB
    max_wal_senders = 0     # max number of walsender processes
    wal_level = minimal         # minimal, archive, or hot_standby
    synchronous_commit = off # apply when your system only load data (if there are other updates from clients it can result in data loss!)
    archive_mode = off      # allows archiving to be done
    autovacuum = off            # Enable autovacuum subprocess?  'on'
    checkpoint_segments = 256       # in logfile segments, min 1, 16MB each; default = 3; 256 = write every 4 GB
    checkpoint_timeout = 30min         # range 30s-1h, default = 5min
    checkpoint_completion_target = 0.9  # checkpoint target duration, 0.0 - 1.0
    checkpoint_warning = 0              # 0 disables, default = 30s

2) transactions (disabled autocommit) + set isolation level (the lowest possible: repeatable read) I create a new table and load data into it in the same transaction.

3) set COPY commands to run a single transaction (supposedly it is the fastest approach to COPY data)

5) disabled autovacuum (will not regenerate statistics after new 50 rows added)

6) FREEZE COPY FREEZE does not speed up the import itself but makes operations after the import faster.

Do you have any other recommendations or maybe you do not agree with the aforementioned settings?

回答1:

Do NOT use indexes except for unique single numeric key.

That doesn't fit with all DB theory we received but testing with heavy loads of data demonstrate it. Here is a result of 100M loads at a time to reach 2 Billions rows in a table, and each time a bunch of various queries on the resulting table. First graphic with 10 gigabit NAS (150MB/s), second with 4 SSD in RAID 0 (R/W @ 2GB/s).

If you have more than 200 millions row in a table on regular disks, it's faster if you forget indexes. On SSD's, the limit is at 1 billion.

I've done it also with partitions for better results but with PG9.2 it's difficult to benefit from them if you use stored procedures. You also have to take care of writing/reading to only 1 partition at a time. However partitions are the way to go to keep your tables below the 1 Billion row wall. It also helps a lot to multiprocess your loads. With SSD, single process let me insert (copy) 18,000 rows/s (with some processing work included). With multiprocessing on 6 CPU, it grows to 80,000 rows/s.

Watch your CPU & IO usage while testing to optimize both.

来源：https://stackoverflow.com/questions/30184431/what-is-the-best-way-to-load-a-massive-amount-of-data-into-postgresql

标签

database

postgresql

copy