问题
I want to load a massive amount of data into PostgreSQL. Do you know any other "tricks" apart from the ones mentioned in the PostgreSQL's documentation?
What have I done up to now?
1) set the following parameters in postgresql.conf (for 64 GB of RAM):
shared_buffers = 26GB
work_mem=40GB
maintenance_work_mem = 10GB # min 1MB default: 16 MB
effective_cache_size = 48GB
max_wal_senders = 0 # max number of walsender processes
wal_level = minimal # minimal, archive, or hot_standby
synchronous_commit = off # apply when your system only load data (if there are other updates from clients it can result in data loss!)
archive_mode = off # allows archiving to be done
autovacuum = off # Enable autovacuum subprocess? 'on'
checkpoint_segments = 256 # in logfile segments, min 1, 16MB each; default = 3; 256 = write every 4 GB
checkpoint_timeout = 30min # range 30s-1h, default = 5min
checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0
checkpoint_warning = 0 # 0 disables, default = 30s
2) transactions (disabled autocommit) + set isolation level (the lowest possible: repeatable read) I create a new table and load data into it in the same transaction.
3) set COPY commands to run a single transaction (supposedly it is the fastest approach to COPY data)
5) disabled autovacuum (will not regenerate statistics after new 50 rows added)
6) FREEZE COPY FREEZE does not speed up the import itself but makes operations after the import faster.
Do you have any other recommendations or maybe you do not agree with the aforementioned settings?
回答1:
Do NOT use indexes except for unique single numeric key.
That doesn't fit with all DB theory we received but testing with heavy loads of data demonstrate it. Here is a result of 100M loads at a time to reach 2 Billions rows in a table, and each time a bunch of various queries on the resulting table. First graphic with 10 gigabit NAS (150MB/s), second with 4 SSD in RAID 0 (R/W @ 2GB/s).
If you have more than 200 millions row in a table on regular disks, it's faster if you forget indexes. On SSD's, the limit is at 1 billion.
I've done it also with partitions for better results but with PG9.2 it's difficult to benefit from them if you use stored procedures. You also have to take care of writing/reading to only 1 partition at a time. However partitions are the way to go to keep your tables below the 1 Billion row wall. It also helps a lot to multiprocess your loads. With SSD, single process let me insert (copy) 18,000 rows/s (with some processing work included). With multiprocessing on 6 CPU, it grows to 80,000 rows/s.
Watch your CPU & IO usage while testing to optimize both.
来源:https://stackoverflow.com/questions/30184431/what-is-the-best-way-to-load-a-massive-amount-of-data-into-postgresql