PostgreSQL Joining Between Two Values

两盒软妹~` 提交于 2019-12-18 09:24:35

问题


I have the following tables and am trying to look up county codes for a list of several hundred thousand cities.

create table counties (
  zip_code_from  char(5) not null,
  zip_code_thru  char(5) not null,
  county_code    char(3) not null
);

create table cities (
  city      text    not null,
  zip_code  char(5) not null
);

My first approach was using a "between" in the join:

select
  ci.city, ci.zip_code, co.county_code
from
  cities ci
  join counties co on
    co.zip_code between ci.zip_code_from and ci.zip_code_thru

I know in the Oracle world, this was frowned upon, and indeed the performance appears to be miserable. It takes over 8 minutes to process around 16,000 cities. The zip code table has around 80,000 records. I'm guessing that this syntax is a glorified cross-join?

Both the from and thru codes are indexed, and I have control over the structures, so I can change the table if it helps.

My only other thought is to go ahead and expand the table out to all possible values -- something similar to this:

select
  generate_series (
    cast (zip_code_from as int),
    cast (zip_code_thru as int)
  ) as zip_code,
  *
from counties

This would expand the data to over 200,000 records, which isn't a big deal, but I wasn't sure if this is my only recourse to have queries that aren't horrible.

I'm guessing that even doing that on the fly and not having indexes would be preferable to the between in my join, but I was hoping there is an alternative, either in terms of my SQL and/or something I can do with the structure of the table itself.

I've seen this question posted for other DBMS platforms, but I've been able to pull off mini-miracles with PostgreSQL that weren't possible (or practical) in other databases, so I was hopeful there is something I've missed.


回答1:


Months later, this has cropped its head again, and I decided to test some of my theories.

The original query:

select
  ci.city, ci.zip_code, co.fips_code
from
  cities ci
  join counties co on
    ci.zip_code between co.from_zip_code and co.thru_zip_code

Does in fact implement a cartesian. The query returns 34,000 rows and takes 597 seconds.

If I "pre-explode" the zip code ranges into discrete records:

with exploded_zip as (
  select
    generate_series (
      cast (from_zip_code as int),
      cast (thru_zip_code as int)
    )::text as zip_code,
    *
  from counties
)
select
  ci.city, ci.zip_code, co.fips_code
from
  cities ci
  join exploded_zip co on
    ci.zip_code = co.zip_code

The query returns the exact same rows but finishes in 2.8 seconds.

So it seems the bottom line is that using a between in a join (or any inequality) is a really bad idea.



来源:https://stackoverflow.com/questions/31030589/postgresql-joining-between-two-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!