Crosstab splitting results due to presence of unrelated field

问题

I'm using postgres 9.1 with tablefunc:crosstab

I have a table with the following structure:

CREATE TABLE marketdata.instrument_data
(
  dt date NOT NULL,
  instrument text NOT NULL,
  field text NOT NULL,
  value numeric,
  CONSTRAINT instrument_data_pk PRIMARY KEY (dt , instrument , field )
)

This is populated by a script that fetches data daily. So it might look like so:

| dt         | instrument        | field     | value |
|------------+-------------------+-----------+-------|
| 2014-05-23 | SGX.MiniJGB.2014U | PX_VOLUME | 1     |
| 2014-05-23 | SGX.MiniJGB.2014U | OPEN_INT  | 2     |

I then use the following crosstab query to pivot the table:

select dt, instrument, vol, oi 
FROM crosstab($$
    select dt, instrument, field, value 
    from marketdata.instrument_data 
    where field = 'PX_VOLUME' or field = 'OPEN_INT'
    $$::text, $$VALUES ('PX_VOLUME'),('OPEN_INT')$$::text
) vol(dt date, instrument text, vol numeric, oi numeric);

Running this I get the result:

| dt         | instrument        | vol | oi |
|------------+-------------------+-----+----|
| 2014-05-23 | SGX.MiniJGB.2014U | 1   | 2  |

The problem: When running this with lot of real data in the table, I noticed that for some fields the function was splitting the result over two rows:

| dt         | instrument        | vol | oi |
|------------+-------------------+-----+----|
| 2014-05-23 | SGX.MiniJGB.2014U | 1   |    |
| 2014-05-23 | SGX.MiniJGB.2014U |     | 2  |

I checked that the dt and instrument fields were identical and produced a work-around by grouping the ouput of the crosstab.

Analysis I've discovered that it's the presence of one other entry in the input table that causes the output to be split over 2 rows. If I have the input as follows:

| dt         | instrument        | field     | value |
|------------+-------------------+-----------+-------|
| 2014-04-23 | EUX.Bund.2014M    | PX_VOLUME | 0     |
| 2014-05-23 | SGX.MiniJGB.2014U | PX_VOLUME | 1     |
| 2014-05-23 | SGX.MiniJGB.2014U | OPEN_INT  | 2     |

I get:

| dt         | instrument        | vol | oi |
|------------+-------------------+-----+----|
| 2014-04-23 | EUX.Bund.2014M    | 0   |    |
| 2014-05-23 | SGX.MiniJGB.2014U | 1   |    |
| 2014-05-23 | SGX.MiniJGB.2014U |     | 2  |

Where it gets really weird...

If I recreate the above input table manually then the output is as we would expect, combined into a single row.

If I run:

update marketdata.instrument_data 
set instrument = instrument 
where instrument = 'EUX.Bund.2014M'

Then again, the output is as we would expect, which is surprising as all I've done is set the instrument field to itself.

So I can only conclude that there is some hidden character/encoding issue in that Bund entry that is breaking crosstab.

Are there any suggestions as to how I can determine what it is about that entry that breaks crosstab?

Edit: I ran the following on the raw table to try and see any hidden characters:

select instrument, encode(instrument::bytea, 'escape')  
from marketdata.bloomberg_future_data_temp 
where instrument = 'EUX.Bund.2014M';

And got:

| instrument     | encode         |
|----------------+----------------|
| EUX.Bund.2014M | EUX.Bund.2014M |

回答1:

Two problems.

1. `ORDER BY` is required.

The manual:

In practice the SQL query should always specify ORDER BY 1,2 to ensure that the input rows are properly ordered, that is, values with the same row_name are brought together and correctly ordered within the row.

With the one-parameter form of crosstab(), ORDER BY 1,2 would be necessary.

2. One column with distinct values per group.

The manual:

crosstab(text source_sql, text category_sql)
source_sql is a SQL statement that produces the source set of data.
...
This statement must return one row_name column, one category column, and one value column. It may also have one or more "extra" columns. The row_name column must be first. The category and value columns must be the last two columns, in that order. Any columns between row_name and category are treated as "extra". The "extra" columns are expected to be the same for all rows with the same row_name value.

Bold emphasis mine. One column. It seems like you want to form groups over two columns, which does not work as you desire.

Solution

I guess you want to group on both leading columns: (dt, instrument). You could play tricks with concatenating or arrays, but that would be slow and / or unreliable. I suggest a cleaner and faster approach with a window function rank() or dense_rank() to produce a single-column unique value per desired group. This is very cheap, because ordering rows is the main cost and the order of the frame is identical to the required order anyway. You can remove the added column in the outer query if desired:

SELECT dt, instrument, vol, oi
FROM   crosstab(
   $$SELECT dense_rank() OVER (ORDER BY dt, instrument) AS rnk
          , dt, instrument, field, value 
     FROM   marketdata.instrument_data 
     WHERE  field IN ('PX_VOLUME', 'OPEN_INT')
     ORDER  BY 1$$
 , $$VALUES ('PX_VOLUME'),('OPEN_INT')$$
   ) vol(rnk int, dt date, instrument text, vol numeric, oi numeric);

More details:

PostgreSQL Crosstab Query

回答2:

You could run a query that replaces irregular characters with an asterisk:

select  regexp_replace(instrument, '[^a-zA-Z0-9]', '*', 'g')
from    marketdata.instrument_data
where   instrument = 'EUX.Bund.2014M'

Perhaps the instrument = instrument assignment discards trailing whitespace. That would also explain why where instrument = 'EUX.Bund.2014M' matches two values that crosstab sees as different.

来源：https://stackoverflow.com/questions/24822498/crosstab-splitting-results-due-to-presence-of-unrelated-field

标签

sql