excluding duplicate fields in a join

醉酒当歌 提交于 2020-01-03 09:57:30

问题


I have a dataset I'm doing analysis on. It turns out it can easily be enriched with demographic and community data which vastly improves the analytical results.

In order to do this I'm joining in demographic and community data before doing analysis. I need to exclude some fields from my core sample set, so my join looks something like this:

select sampledata.c1, 
       sampledata.c2, 
       demographics.*, 
       community.* 
from sample data 
    join demographics using (zip) 
    join community using (fips)

This gets me multiple zip or fips columns in the output which my analysis engine can't deal with. I can't specify each field by hand - the enrichment tables result in hundreds of columns in the end.

I could do select *, but then I'd have all the columns from my sample data which I don't want.

How can I join in my enrichment data without duplicating fields, whilst still selecting the columns I want from my sample table?

One thought I had, was if postgres (my database) could fully qualify each column in the output (like sample.c1, demographics.c1, etc) I would be perfectly happy with this.


回答1:


There is no column exclusion syntax in SQL, there is only column inclusion syntax (via the * operator for all columns, or listing the column names explicitly).

Generate list of only columns you want

However, you could generate the SQL statement with its hundreds of column names, minus the few duplicate columns you do not want, using schema tables and some built-in functions of your database.

SELECT
    'SELECT sampledata.c1, sampledata.c2, ' || ARRAY_TO_STRING(ARRAY(
        SELECT 'demographics' || '.' || column_name
        FROM information_schema.columns
        WHERE table_name = 'demographics' 
        AND column_name NOT IN ('zip')
        UNION ALL
        SELECT 'community' || '.' || column_name
        FROM information_schema.columns
        WHERE table_name = 'community' 
        AND column_name NOT IN ('fips')
    ), ',') || ' FROM sampledata JOIN demographics USING (zip) JOIN community USING (fips)'
AS statement

This only prints out the statement, it does not execute it. Then you just copy the result and run it.

If you want to both generate and run the statement dynamically in one go, then you may read up on how to run dynamic SQL in the PostgreSQL documentation.

Prepend column names with table name

Alternately, this generates a select list of all the columns, including those with duplicate data, but then aliases them to include the table name of each column as well.

SELECT
    'SELECT ' || ARRAY_TO_STRING(ARRAY(
        SELECT table_name || '.' || column_name || ' AS ' || table_name || '_' || column_name
        FROM information_schema.columns
        WHERE table_name in ('sampledata', 'demographics', 'community')
    ), ',') || ' FROM sampledata JOIN demographics USING (zip) JOIN community USING (fips)'
AS statement

Again, this only generates the statement. If you want to both generate and run the statement dynamically, then you'll need to brush up on dynamic SQL execution for your database, otherwise just copy and run the result.

If you really want a dot separator in the column aliases, then you'll have to use double-quoted aliases such as SELECT table_name || '.' || column_name || ' AS "' || table_name || '.' || column_name || '"'. However, double-quoted aliases can cause extra complications (case-sensitivity, etc); so, I used the underscore character instead to separate the table name from the column name within the alias, and the aliases can then be treated like regular column names else-wise.



来源:https://stackoverflow.com/questions/15061733/excluding-duplicate-fields-in-a-join

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!