Modeling set-based code-listings of sql data manipulation operations

Technologies like LINQ do a good job being able to describe relational data queries, with types such as IQueryable, IGrouping, and IOrderedQueryable modeling projections, selections, aggregations, sorting, etc. These concepts from relational algebra allow us to communicate a fairly arbitrary query in one language on one machine and execute it in a different language (~sql) on a different machine.

It would be nice to be able to do the same thing for even more complicated multi-part queries and even for data manipulation commands involving INSERTs, UPDATEs, and DELETEs which can describe the full operation without the overhead of first retrieving/hydrating the data in the application layer which is typical of Object-Relational-Mappers or ORMs.

In the application, we could describe an operation like Delete all Customers (and their Orders) whose most recent Order is older than 2 years (and further, assume cascading deletes are not turned on for that relationship). This is certainly doable efficiently over say, ADO with a t-sql script, but cannot be done in ORMs the without overhead of selecting, transmitting, hydrating, and tracking the data in the application layer and possibly issuing individual delete commands. (Maybe there's some optimizations available for ORMs that can do this more efficiently in certain cases but generally AFAIK they cannot) A problem with issuing the t-sql script, of course, is that there's no type checking within the statement nor for any parameters or return data.

An earth-shattering advantage to being able to model these arbitrary commands for remote execution, besides reduced runtime processing and network chatter overhead, is that domain-wide invariants could be encoded and registered in the application layer that could then be automatically emitted along with any ad-hoc command.

We might have a silly domain invariant A that For all Customers, the sum per Customer of the Orders' prices cannot exceed $10,000,000.00 unless the whoa bit is 1 and another silly domain invariant B that says For all Customers, lastName can't contain more than three underscores (despite that these perhaps could be enforced through native mechanisms of check constraints or triggers in the database engine itself). Then when we issue a command to update an existing Order's price, the system can know through static analysis that invariant A might be violated as a result of the command and that invariant B could not and therefore the system would emit some assertion of A after the original command. The whole emitted script would be wrapped in a transaction (for rollback if the assertion fails) and the invariant could be automatically narrowed to assert the rule only against the specific Customer's set of Orders rather than unnecessarily rechecking all Customers' totals. I believe this kind of optimized, centralized, DRY, business rule encoding/enforcement is not possible in today's products.

In order to realize this potential, I figure we need an algebra (beyond the relational algebra of SELECT) that describes arbitrary data manipulation of INSERT, UPDATE, and DELETE (which is collectively referred to as DML) and even things like intermediate computed values like temporary tables for computation that is represented as a multi-statement listing in t-sql.

Unfortunately, I have been unable to find research about formalizing DML into an algebra, or being able to model it, or meta-programming of this kind. Although Tutorial-D and jOOq seem to offer something to the discussion - I just don't know how to extract it. Are you able to lend clarity to this?

Some discussion which I think was valuable, but I want to avoid filling up the comments with it:

Are you suggesting that domain models aren't a good fit to protect invariants and establishing transactional boundaries? The invariants you mentioned aren't hard to protect using a proper domain model. What problem are you trying to avoid exactly?

– plalx

As I understand it, large domains in typical ddd require bounded contexts to avoid having to hydrate large subsets of the data into the application layer for validation. I am trying to avoid that overhead. Also, domain invariants must be non-trivially restated for each bounded context, which is error-prone. By modeling the operations for remote execution, we get smarter/smaller/faster/more correct code.

In some core library, the domain could be modeled and the invariants registered. Then consumers of that library, such as for a web service, could then construct type-checked descriptions of arbitrary operations without explicit consideration for bounded contexts or particular invariants. The domain core offers to its consumers "this is the full range of what you can do over this domain" and (perhaps) the service code offers to its clients "these are the exact features we're offering".

– uosɐſ

I'm not sure if you understood correctly what a Bounded Context is and how they might communicate with each-other. "Also, domain invariants must be non-trivially restated/maintained for each bounded context which is error-prone" There's usually just one context that have data ownership and that context shall be responsible for invariants involving it's own data. For instance, imagine a company that sells goods on Internet. They might have an Inventory context where products gets maintained and a Shopping context that listen to newly available products from the Inventory.

– plalx

I'm not very much arguing against current ddd techniques, so I'm not choosing excellent examples against them. I'm more interested in this alternative arrangement which I intuit would be more natural and advanced than current ddd techniques. I've seen data models that are extremely intertwined and don't offer obvious boundaries (perhaps poorly designed, OK). I expect that this way could be boundaryless AND more performant.

– uosɐſ

If there was a rule that a Product name couldn't contain the word "propaganda" it would be enforced only in the Inventory context. If we were to duplicate invariants of every contexts in every other contexts it would indeed become a maintenance nightmare.

– plalx

But you plausibly might have a bounded context centered on Customers and a second bounded context centered on Orders. And maybe the $10,000,000.00 Limit I mentioned is made to be a column in Customer (and therefore variable), so this business rule can be violated in two ways: either by dropping that Limit on Customer or increasing totals in Order. So non-trivially reciprocal rules must check for violations in either bounded context depending on the change. Our system could decide to skip the assertion if Prices and Limits aren't changed, which would be pretty slick, no? In the traditional ddd, you might also need some optimized variants for bulk manipulations (Add an Order of $1000 to every Customer) which could be automatically derived by our new system.

– uosɐſ

Unlikely as it might seem, the one thing you don't need is "something beyond" relational algebra. It's not a theoretical problem at all, but one of imagination and engineering. The problem you're talking about crosses several domains: programming language, library support, and DBMS. It could be done (and should). But first it needs to be commonly understood as realistic and desirable, and we're not there yet.

As far as the algebra is concerned, all that's missing is assignment. If you've read Date's Third Manifesto, you may recall that insert/update/delete are just variations on assignment:

S += f(R)        -- insert
S += f(R) - g(S) -- update
S -= f(R)        -- delete

(Python does a fair job of demonstrating that with the set class in its standard library, btw, except that you don't get operators for sets-of-tuples out of the box.)

So it's not a theoretical problem; the algebra is fine. And you're not asking purely about syntax, either. What you want, it seems to me, is a DBMS that you can manipulate functionally, without SQL -- and SQL generators -- acting as an intermediary. Wouldn't it be nice if the tables in your database appeared as variables in your programming language, and there was a relational-algebra library (for that language) that supported select, project, and join?

For that matter, why not incorporate relational operators into the language proper? Why, 40 years after relational theory was invented, is its use limited to databases? That in fact has been a lament of the database community for decades. Although it's been done -- cf. Datalog, for example -- the surfeit of new languages we've seen in recent years has been notable for continuing the C tradition of no support for set-theoretic operations.

As it happens, though, just having relations and relational operators built into the language wouldn't be enough. Programming languages generally expect to define their variables, and to own them exclusively. That's practically the definition of a programming language: something that defines and manipulates chucks of memory, the lifetime of which is bounded by the execution of the program. And the interesting data usually starts "out there", somewhere, not in program memory.

So, what you really, really want is to manipulate data "in the database" as though those tables were program variables (otherwise known as action at a distance), and then some super-convenient, ideally transparent, way to move the results into program memory. Like, oh, assignment. And to make any headway at all in that direction, you need the cooperation of the DBMS.

To interact with a typical DBMS these days, you formulate your question in its language (usually SQL) and fetch the output row by row into program memory. It's an I/O model: write string, read results. To take that I/O out of the programming model, you need a different API, something more like RPC. If the programming language and the DBMS use the same data model (relations) and functions (relational algebra) and data types, then you have a fighting chance at operating on both remote and local data in the same way.

That's the suite:

language support for relations and relational operations
language recognition of local and out-of-machine variables
DBMS support to programmatically expose table definitions, such that a compiler/interpreter can "link" to them, as library symbols
DBMS support for remote invocation of relational operators, function by function, not statement by statement

You may have noticed that, to a reasonable approximation, no one is trying to do the above. Language designers universally ignore set theory and predicate logic. DBMS vendors -- and popular free projects -- are shackled to SQL, utterly uninterested in fixing SQL's set-theoretic flaws or exposing their systems through a logical-function API. The furthest thing from anyone's mind is developing a congruent set of types and operators.

So what do we have instead? Linc is a good example of a dancing bear, glopping together SQL from strings and primitive types, squirting it over a pipe, and expressing database tables to row-by-row operations supplied by the host language. It's a pretty good show, given the reality of today's environment. But, as your question suggests, the novelty wears off, and the work doesn't get any easier. You might want to hold onto your ticket, though: judging progress by its current speed and direction, you'll be in the same seat for another 40 years.

来源：https://stackoverflow.com/questions/32894724/modeling-set-based-code-listings-of-sql-data-manipulation-operations

标签

database

domain-driven-design

metaprogramming

dsl