The query:
SELECT \"replays_game\".*
FROM \"replays_game\"
INNER JOIN
\"replays_playeringame\" ON \"replays_game\".\"id\" = \"replays_playeringame\".\"game_
I ran sayap's testbed-code (Thanks!) , with the following modifications:
After this run, I did the same run, but scaled up tenfold: with 1M5 records (30K hard-hitters)
Currently, I am running the same test with a hundred-fold scale-up, but the initialisation is rather slow...
Results The entries in the cells are the total time in msec plus a string that denotes the chosen queryplan. (only a handfull of plans occur)
Original 3K / 150K work_mem=16M
rpc | 3K | 1K5 | 750 | 375
--------+---------------+---------------+---------------+------------
8* | 50.8 H.BBi.HS| 44.3 H.BBi.HS| 38.5 H.BBi.HS| 41.0 H.BBi.HS
4 | 43.6 H.BBi.HS| 48.6 H.BBi.HS| 4.34 NBBi | 1.33 NBBi
2 | 6.92 NBBi | 3.51 NBBi | 4.61 NBBi | 1.24 NBBi
1 | 6.43 NII | 3.49 NII | 4.19 NII | 1.18 NII
Original 3K / 150K work_mem=64K
rpc | 3K | 1K5 | 750 | 375
--------+---------------+---------------+---------------+------------
8* | 74.2 H.BBi.HS| 69.6 NBBi | 62.4 H.BBi.HS| 66.9 H.BBi.HS
4 | 6.67 NBBi | 8.53 NBBi | 1.91 NBBi | 2.32 NBBi
2 | 6.66 NBBi | 3.6 NBBi | 1.77 NBBi | 0.93 NBBi
1 | 7.81 NII | 3.26 NII | 1.67 NII | 0.86 NII
Scaled 10*: 30K / 1M5 work_mem=16M
rpc | 30K | 15K | 7k5 | 3k75
--------+---------------+---------------+---------------+------------
8* | 623 H.BBi.HS| 556 H.BBi.HS| 531 H.BBi.HS| 14.9 NBBi
4 | 56.4 M.I.sBBi| 54.3 NBBi | 27.1 NBBi | 19.1 NBBi
2 | 71.0 NBBi | 18.9 NBBi | 9.7 NBBi | 9.7 NBBi
1 | 79.0 NII | 35.7 NII | 17.7 NII | 9.3 NII
Scaled 10*: 30K / 1M5 work_mem=64K
rpc | 30K | 15K | 7k5 | 3k75
--------+---------------+---------------+---------------+------------
8* | 729 H.BBi.HS| 722 H.BBi.HS| 723 H.BBi.HS| 19.6 NBBi
4 | 55.5 M.I.sBBi| 41.5 NBBi | 19.3 NBBi | 13.3 NBBi
2 | 70.5 NBBi | 41.0 NBBi | 26.3 NBBi | 10.7 NBBi
1 | 69.7 NII | 38.5 NII | 20.0 NII | 9.0 NII
Scaled 100*: 300K / 15M work_mem=16M
rpc | 300k | 150K | 75k | 37k5
--------+---------------+---------------+---------------+---------------
8* |7314 H.BBi.HS|9422 H.BBi.HS|6175 H.BBi.HS| 122 N.BBi.I
4 | 569 M.I.sBBi| 199 M.I.sBBi| 142 M.I.sBBi| 105 N.BBi.I
2 | 527 M.I.sBBi| 372 N.BBi.I | 198 N.BBi.I | 110 N.BBi.I
1 | 694 NII | 362 NII | 190 NII | 107 NII
Scaled 100*: 300K / 15M work_mem=64K
rpc | 300k | 150k | 75k | 37k5
--------+---------------+---------------+---------------+------------
8* |22800 H.BBi.HS |21920 H.BBi.HS | 20630 N.BBi.I |19669 H.BBi.HS
4 |22095 H.BBi.HS | 284 M.I.msBBi| 205 B.BBi.I | 116 N.BBi.I
2 | 528 M.I.msBBi| 399 N.BBi.I | 211 N.BBi.I | 110 N.BBi.I
1 | 718 NII | 364 NII | 200 NII | 105 NII
[8*] Note: the RandomPageCost=8 runs were only intended as a prerun to prime the disk buffer cache; the results should be ignored.
Legend for node types:
N := Nested loop
M := Merge join
H := Hash (or Hash join)
B := Bitmap heap scan
Bi := Bitmap index scan
S := Seq scan
s := sort
m := materialise
Preliminary conclusion:
"the working set" for the original query is too small: all of it fits in core, resulting in the cost of page fetches to be grossly overestimated. Setting RPC to 2 (or 1) "solves" this problem, but once the query is scaled-up, the page-costs become dominant, and RPC=4 becomes comparable or even better.
Setting work_mem to a lower value is another way to make the optimiser shift to index-scans (instead of hash+bitmap-scans). The differences I found are smaller than what Sayap reported. Maybe I have more effective_cache_size, or he forgot to prime the cache?