Is it possible to answer queries on a view before fully materializing the view?

问题

In short: Distinct,Min,Max on the Left hand side of a Left Join, should be answerable without doing the join.

I’m using a SQL array type (on Postgres 9.3) to condense several rows of data in to a single row, and then a view to return the unnested normalized view. I do this to save on index costs, as well as to get Postgres to compress the data in the array.
Things work pretty well, but some queries that could be answered without unnesting and materializing/exploding the view are quite expensive because they are deferred till after the view is materialized. Is there any way to solve this?

Here is the basic table:

CREATE TABLE mt_count_by_day
(
  run_id integer NOT NULL,
  type character varying(64) NOT NULL,
  start_day date NOT NULL,
  end_day date NOT NULL,
  counts bigint[] NOT NULL,
  CONSTRAINT mt_count_by_day_pkey PRIMARY KEY (run_id, type),
)

An index on ‘type’ just for good measure:

CREATE INDEX runinfo_mt_count_by_day_type_idx on runinfo.mt_count_by_day (type);

Here is the view that uses generate_series and unnest

CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
 SELECT mt_count_by_day.run_id,
    mt_count_by_day.type,
    mt_count_by_day.brand,
    generate_series(mt_count_by_day.start_day::timestamp without time zone, mt_count_by_day.end_day - '1 day'::interval, '1 day'::interval) AS row_date,
    unnest(mt_count_by_day.counts) AS row_count
   FROM runinfo.mt_count_by_day;

What if I want to do distinct on the ‘type' column?

explain analyze select distinct(type) from mt_count_by_day;

"HashAggregate  (cost=9566.81..9577.28 rows=1047 width=19) (actual time=171.653..172.019 rows=1221 loops=1)"
"  ->  Seq Scan on mt_count_by_day  (cost=0.00..9318.25 rows=99425 width=19) (actual time=0.089..99.110 rows=99425 loops=1)"
"Total runtime: 172.338 ms"

Now what happens if I do the same on the view?

explain analyze select distinct(type) from v_mt_count_by_day;

"HashAggregate  (cost=1749752.88..1749763.34 rows=1047 width=19) (actual time=58586.934..58587.191 rows=1221 loops=1)"
"  ->  Subquery Scan on v_mt_count_by_day  (cost=0.00..1501190.38 rows=99425000 width=19) (actual time=0.114..37134.349 rows=68299959 loops=1)"
"        ->  Seq Scan on mt_count_by_day  (cost=0.00..506940.38 rows=99425000 width=597) (actual time=0.113..24907.147 rows=68299959 loops=1)"
"Total runtime: 58587.474 ms"

Is there a way to get postgres to recognize that it can solve this without first exploding the view?

Here we can see for comparison we are counting the number of rows matching criteria in the table vs the view. Everything works as expected. Postgres filters down the rows before materializing the view. Not quite the same, but this property is what makes our data more manageable.

explain analyze select count(*) from mt_count_by_day where type = ’SOCIAL_GOOGLE'
"Aggregate  (cost=157.01..157.02 rows=1 width=0) (actual time=0.538..0.538 rows=1 loops=1)"
"  ->  Bitmap Heap Scan on mt_count_by_day  (cost=4.73..156.91 rows=40 width=0) (actual time=0.139..0.509 rows=122 loops=1)"
"        Recheck Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"        ->  Bitmap Index Scan on runinfo_mt_count_by_day_type_idx  (cost=0.00..4.72 rows=40 width=0) (actual time=0.098..0.098 rows=122 loops=1)"
"              Index Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"Total runtime: 0.625 ms"

explain analyze select count(*) from v_mt_count_by_day where type = 'SOCIAL_GOOGLE'
"Aggregate  (cost=857.11..857.12 rows=1 width=0) (actual time=6.827..6.827 rows=1 loops=1)"
"  ->  Bitmap Heap Scan on mt_count_by_day  (cost=4.73..357.11 rows=40000 width=597) (actual time=0.124..5.294 rows=15916 loops=1)"
"        Recheck Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"        ->  Bitmap Index Scan on runinfo_mt_count_by_day_type_idx  (cost=0.00..4.72 rows=40 width=0) (actual time=0.082..0.082 rows=122 loops=1)"
"              Index Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"Total runtime: 6.885 ms"

Here is the code required to reproduce this:

CREATE TABLE base_table
(
  run_id integer NOT NULL,
  type integer NOT NULL,
  start_day date NOT NULL,
  end_day date NOT NULL,
  counts bigint[] NOT NULL
  CONSTRAINT match_check CHECK (end_day > start_day  AND (end_day - start_day) = array_length(counts, 1)),
  CONSTRAINT base_table_pkey PRIMARY KEY (run_id, type)
);

--Just because...
CREATE INDEX base_type_idx on base_table (type);

CREATE OR REPLACE VIEW v_foo AS
SELECT m.run_id,
       m.type,
       t.row_date::date,
       t.row_count
FROM   base_table m
LEFT   JOIN LATERAL ROWS FROM (
          unnest(m.counts),
          generate_series(m.start_day, m.end_day-1, interval '1d')
       ) t(row_count, row_date) ON true;



insert into base_table
select a.run_id, a.type, '20120101'::date as start_day, '20120401'::date as end_day, b.counts  from (SELECT N AS run_id, L as type
FROM
    generate_series(1, 10000) N
CROSS JOIN
    generate_series(1, 7) L
ORDER BY N, L) a,  (SELECT array_agg(generate_series)::bigint[] as counts FROM generate_series(1, 91) ) b

And the results on 9.4.1:

explain analyze select distinct type from base_table;

"HashAggregate  (cost=6750.00..6750.03 rows=3 width=4) (actual time=51.939..51.940 rows=3 loops=1)"
"  Group Key: type"
"  ->  Seq Scan on base_table  (cost=0.00..6600.00 rows=60000 width=4) (actual time=0.030..33.655 rows=60000 loops=1)"
"Planning time: 0.086 ms"
"Execution time: 51.975 ms"

explain analyze select distinct type from v_foo;

"HashAggregate  (cost=1356600.01..1356600.04 rows=3 width=4) (actual time=9215.630..9215.630 rows=3 loops=1)"
"  Group Key: m.type"
"  ->  Nested Loop Left Join  (cost=0.01..1206600.01 rows=60000000 width=4) (actual time=0.112..7834.094 rows=5460000 loops=1)"
"        ->  Seq Scan on base_table m  (cost=0.00..6600.00 rows=60000 width=764) (actual time=0.009..42.694 rows=60000 loops=1)"
"        ->  Function Scan on t  (cost=0.01..10.01 rows=1000 width=0) (actual time=0.091..0.111 rows=91 loops=60000)"
"Planning time: 0.132 ms"
"Execution time: 9215.686 ms"

回答1:

Generally, the Postgres query planner does "inline" views to optimize the whole query. Per documentation:

One application of the rewrite system is in the realization of views. Whenever a query against a view (i.e., a virtual table) is made, the rewrite system rewrites the user's query to a query that accesses the base tables given in the view definition instead.

But I don't think Postgres is smart enough to conclude that it can reach the same result from the base table without exploding rows.

You can try this alternative query with a LATERAL join. It's cleaner:

CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT m.run_id, m.type, m.brand
     , m.start_day + c.rn - 1 AS row_date
     , c.row_count
FROM   runinfo.mt_count_by_day m
LEFT   JOIN LATERAL unnest(m.counts) WITH ORDINALITY c(row_count, rn) ON true;

It also makes clear that one of (end_day, start_day) is redundant.

Using LEFT JOIN because that might allow the query planner to ignore the join from your query:

   SELECT DISTINCT type FROM v_mt_count_by_day;

Else (with a CROSS JOIN or INNER JOIN) it must evaluate the join to see whether rows from the first table are eliminated.

BTW, it's:

SELECT DISTINCT type ...

not:

SELECT DISTINCT(type) ...

Note that this returns a date instead of the timestamp in your original. Easer, and I guess it's what you want anyway?

Requires Postgres 9.3+ Details:

PostgreSQL unnest() with element number

ROWS FROM in Postgres 9.4+

To explode both columns in parallel safely:

CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT m.run_id, m.type, m.brand
       t.row_date::date, t.row_count
FROM   runinfo.mt_count_by_day m
LEFT   JOIN LATERAL ROWS FROM (
          unnest(m.counts)
        , generate_series(m.start_day, m.end_day, interval '1d')
       ) t(row_count, row_date) ON true;

The main benefit: This would not derail into a Cartesian product if the two SRF don't return the same number of rows. Instead, NULL values would be padded.

Again, I can't say whether this would help the query planner with a faster plan for DISTINCT type without testing.

来源：https://stackoverflow.com/questions/28730338/is-it-possible-to-answer-queries-on-a-view-before-fully-materializing-the-view

标签

postgresql

view

postgresql-performance

optimization

set-returning-functions