Improving a function that UPSERTs based on an input array

I am hoping to get some help improving a method for UPSERTing rows passed in as an array. I'm on Postgres 11.4 with deployment on RDS. I'm got a lot of tables to sort out, but am starting with a simple table for experimentation:

BEGIN;
DROP TABLE IF EXISTS "data"."item" CASCADE;

CREATE TABLE IF NOT EXISTS "data"."item" (
    "id" uuid NOT NULL DEFAULT NULL,
    "marked_for_deletion" boolean NOT NULL DEFAULT false,
    "name_" citext NOT NULL DEFAULT NULL,

CONSTRAINT item_id_pkey
    PRIMARY KEY ("id")
);

CREATE INDEX item_marked_for_deletion_ix_bgin ON "data"."item" USING GIN("marked_for_deletion") WHERE marked_for_deletion = true;

ALTER TABLE "data"."item" OWNER TO "user_change_structure";
COMMIT;

The function, so far, looks like this:

DROP FUNCTION IF EXISTS data.item_insert_array (item[]);

CREATE OR REPLACE FUNCTION data.item_insert_array (data_in item[]) 
  RETURNS int
AS $$
INSERT INTO item (
    id, 
    marked_for_deletion, 
    name_)

SELECT
    d.id, 
    d.marked_for_deletion,
    d.name_

FROM unnest(data_in) d

ON CONFLICT(id) DO UPDATE SET 
    marked_for_deletion = EXCLUDED.marked_for_deletion,
    name_ = EXCLUDED.name_;

SELECT cardinality(data_in); -- array_length() doesn't work. ¯\_(ツ)_/¯

$$ LANGUAGE sql;

ALTER FUNCTION data.item_insert_array(item[]) OWNER TO user_bender;

And a call looks like this:

select * from item_insert_array(

    array[
        ('2f888809-2777-524b-abb7-13df413440f5',true,'Salad fork'),
        ('f2924dda-8e63-264b-be55-2f366d9c3caa',false,'Melon baller'),
        ('d9ecd18d-34fd-5548-90ea-0183a72de849',true,'Fondue fork')
        ]::item[]
    );

I'm trying to develop a system for UPSERT that is injection-safe and that performs well. I'll be replacing a more naive multi-value insert where the INSERT is composed completely on the client side. Meaning, I can't be certain that I'm not introducing defects when concatenating the text. (I asked about this here: Postgres bulk insert/update that's injection-safe. Perhaps a function that takes an array?)

I've gotten this far with the help of various excellent answers:

https://dba.stackexchange.com/questions/224785/pass-array-of-mixed-type-into-stored-function

https://dba.stackexchange.com/questions/131505/use-array-of-composite-type-as-function-parameter-and-access-it

https://dba.stackexchange.com/questions/225176/how-to-pass-an-array-to-a-plpgsql-function-with-variadic-parameter/

I'm not trying for the most complex version of all of this, for instance, I am fine with a single function per table, and fine that every array element has exactly the same format. I'll write code generators to build out everything I need, once I've got the basic pattern sorted out. So, I don't think that I need VARIADIC parameter lists, polymorphic elements, or everything-packaged-as-JSON. (Although I will need to insert JSON from time to time, that's just data.)

I can still use some remedial help with some questions:

Is the code above injection-safe, or do I need to rewrite it in PL/pgSQL to use something like FOREACH with an EXECUTE...USING or FORMAT or quote_literal, etc.?
I'm setting the input array to item[]. That's fine as I'm passing in all of the fields for this tiny table, but I won't always want to pass in all columns. I thought that I could use anyarray as the type within the function, but I can't figure out how to pass in an array in that scenario. Is there a generic array-of-stuff type? I can create custom types for each of these functions, but I'd rather not. Mainly, because I would only use the type in that one situation.
It seems like it would make sense to implement this as a procedure rather than a function so that I can handle the transaction within the function. Am I off base on that?
Any stylistic (or otherwise) on what to return? I'm returning a count now, which is at least a little useful.

I'm out over my skis a bit here, so any general comments will be much appreciated. For clarity, what I'm after is a schema for inserting multiple rows safely and with decent performance that, ideally, doesn't involve a custom type per function or COPY.

Thanks!

I think that your code is fine. To answer your questions:

Since you are not using dynamic SQL, you are automatically safe from SQL injection.
Using anyarray is tricky and usually only works well with C functions. You may be able to get it to work, but it would be complicated (for example, involve metadata queries).

Your idea to supply incomplete rows would need to solve some difficulties:
- How would you tell which array element belongs to which column, if some columns are skipped?
If you want to UPDATE only certain columns in the conflict case, you could supply an additional upd_cols varbit argument that says which columns should be updated.

The update for the fourth column could then look like this:
```
SET col = CASE WHEN get_bit(upd_cols, 3) = 1
               THEN EXCLUDED.col
               ELSE item.col
          END
```
I don't see any advantage in using transaction control inside your code. This would decrease the usefulness of the ~~function~~procedure, because the caller might want to call your function and later roll back the action because of some problem.
I would RETURN void (or use a procedure), because the caller already has the information how big the array is. This is what normal DML statements do.

We've got a lot of different servers pushing up to central tables in Postgres, which adds another wrinkle. What if I add a column to my table:

ALTER TABLE item ADD COLUMN category citext;

Now the table has four columns instead of three.

All of my existing pushes immediately break because now there's a column missing from the inputs. There is a 0% chance that we can update all of the server simultaneously, so that's no an option.

One solution is to create a custom type for each version of the table:

CREATE TYPE item_v1 AS (
    id uuid,
    marked_for_deletion boolean,
    name_ citext);

CREATE TYPE item_v2 AS (
    id uuid,
    marked_for_deletion boolean,
    name_ citext,
    category citext);

And then a function for each type:

CREATE OR REPLACE FUNCTION data.item_insert_array (data_in item_v1[]) 
etc.

CREATE OR REPLACE FUNCTION data.item_insert_array (data_in item_v2[]) 
etc.

I guess you could have a single ginormous method that takes anyarray and uses a CASE to sort out what code to run. I wouldn't do that for a few reasons, but I suppose you could. (I've seen that approach turn gangrenous in more than one language in a real hurry.)

All of that seems like a fair bit of work. Is there a simpler technique I'm missing? I'm imagining that you could submit structured text/XML/JSON, unpack it and work from there. But I would not file that under "simpler."

I'm still working through the design here, obviously. I've written up enough code to test out what I've shown, but want to sort out the details before going back and implementing this on dozens of tables.

Thanks for any help.

来源：https://stackoverflow.com/questions/57517980/improving-a-function-that-upserts-based-on-an-input-array

标签

arrays

postgresql

function

sql-injection