Is it better to filter a resultset using a WHERE clause or using application code?

后端 未结 3 426
面向向阳花
面向向阳花 2021-01-02 06:52

OK, here is a simple abstraction of the problem:

2 variables(male_users and female_users) to store 2 groups of user i.e. male and female

  1. 1 way is to us
3条回答
  •  Happy的楠姐
    2021-01-02 07:34

    I'd argue that there's really no reason to make your DB do the extra work of evaluating the WHERE clause. Given that you actually want all the records, you will have to do the work of fetching them. If you do a single SELECT from the table, it will retrieve them all in table-order and you can partition them yourself. If you SELECT WHERE male and SELECT WHERE female, you'll have to hit an index for each operation, and you'll lose some data locality.

    For example, if your records on disk are alternating male-female and you have a dataset much larger than memory, you'll likely have to read the entire database twice if you do two separate queries, whereas a single SELECT for both will be a single table scan.

    EDIT: Since I'm getting downmodded into oblivion, I decided to actually run the test. I've generated a table

    CREATE TEMPORARY TABLE gender_test (some_data DOUBLE PRECISION, gender CHARACTER VARYING(20));

    I generated some random data,

    select gender, count(*) from gender_test group by gender;
    gender | count
    --------+----------
    female | 12603133
    male | 10465539
    (2 rows)

    First, let's run these tests without indices, in which case I'm quite sure I'm right...

    test=> EXPLAIN ANALYSE SELECT * FROM gender_test WHERE gender='male';
    QUERY PLAN


    Seq Scan on gender_test (cost=0.00..468402.00 rows=96519 width=66) (actual time=0.030..4595.367 rows=10465539 loops=1)
    Filter: ((gender)::text = 'male'::text)
    Total runtime: 5150.263 ms

    test=> EXPLAIN ANALYSE SELECT * FROM gender_test WHERE gender='female';
    QUERY PLAN


    Seq Scan on gender_test (cost=0.00..468402.00 rows=96519 width=66) (actual time=0.029..4751.219 rows=12603133 loops=1) Filter: ((gender)::text = 'female'::text)
    Total runtime: 5418.891 ms

    test=> EXPLAIN ANALYSE SELECT * FROM gender_test;
    QUERY PLAN


    Seq Scan on gender_test (cost=0.00..420142.40 rows=19303840 width=66) (actual time=0.021..3326.164 rows=23068672 loops=1)
    Total runtime: 4543.393 ms (2 rows)

    Funny, looks like fetching the data in a table scan without the filter is indeed faster! In fact, more than twice as fast! (5150 + 5418 > 4543) Much like I predicted! :-p

    Now, let's make an index and see if it changes the results...

    CREATE INDEX test_index ON gender_test(gender);

    Now to rerun the same queries...

    test=> EXPLAIN ANALYSE SELECT FROM gender_test WHERE gender='male';
    QUERY PLAN


    Bitmap Heap Scan on gender_test (cost=2164.69..195922.27 rows=115343 width=66) (actual time=2008.877..4388.348 rows=10465539 loops=1)
    Recheck Cond: ((gender)::text = 'male'::text)
    -> Bitmap Index Scan on test_index (cost=0.00..2135.85 rows=115343 width=0) (actual time=2006.047..2006.047 rows=10465539 loops=1)
    Index Cond: ((gender)::text = 'male'::text)
    Total runtime: 4941.64 ms

    test=> EXPLAIN ANALYSE SELECT * FROM gender_test WHERE gender='female';
    QUERY PLAN


    Bitmap Heap Scan on gender_test (cost=2164.69..195922.27 rows=115343 width=66) (actual time=1915.385..4269.933 rows=12603133 loops=1)
    Recheck Cond: ((gender)::text = 'female'::text)
    -> Bitmap Index Scan on test_index (cost=0.00..2135.85 rows=115343 width=0) (actual time=1912.587..1912.587 rows=12603133 loops=1)
    Index Cond: ((gender)::text = 'female'::text)
    Total runtime: 4931.555 ms (5 rows)

    test=> EXPLAIN ANALYSE SELECT * FROM gender_test;
    QUERY PLAN


    Seq Scan on gender_test (cost=0.00..457790.72 rows=23068672 width=66) (actual time=0.021..3304.836 rows=23068672 loops=1)
    Total runtime: 4523.754 ms

    Funny.... scanning the entire table in one go is still twice as fast! (4941 + 4931 vs 4523)

    NOTE There's all sorts of ways this is unscientific. I'm running with 16GB of RAM, so the entire dataset fits into memory. Postgres isn't configured to use nearly that much, but disk cache still helps... I'd hypothesize (but can't be assed to actually try) that the effects only get worse once you hit disk. I only tried the default btree Postgres indexing. I'm assuming the PHP partitioning takes no time - not true, but probably a pretty reasonable approximation.

    All tests run on a Mac Pro 8-way 2.66 Xeon 16GB RAID-0 7200rpm

    Also, this dataset is 26 million rows, which is probably a bit larger than most people care about...

    Obviously, raw speed isn't the only thing you care about. In many (most?) applications, you'd care more about the logical "correctness" of fetching them separately. But, when it comes down to your boss saying "we need this to go faster" this will (apparently) give you a 2x speedup. The OP explicitly asked about efficiency. Happy?

提交回复
热议问题