Performance for RBAR vs. set-based processing with varying transactional sizes

问题

It is conventional wisdom that set based processing of tables should always be preferred over RBAR - especially when the tables grow larger and/or you need to update many rows.

But does that always hold? I have experienced quite a few situations - on different hardware - where set-based processing shows exponential growth in time consumption, while splitting the same workload into smaller chunks gives linear growth.

I think it would be interesting either to be proven totally wrong - if I'm missing something obvious - or, if not, it would be very good to know when splitting the workload is worth the effort. And subsequently identifying what indicators help make the decision of which approach to use. I'm personally expecting the following components to be interesting:

Size of workload
Size and growth of logfile
Amount of RAM
Speed of disksystem

Any other? Number of CPUs/CPU cores?

Example 1: I have a 12 million row table and I have to update one or two fields in each row with data from another table. If I do this in one simple UPDATE, this takes ~30 minutes on my test box. But I'll be done in ~24 minutes if I split this into twelve chunks - ie.:

WHERE <key> BETWEEN 0 AND 1000000
WHERE <key> BETWEEN 1000000 AND 2000000
...

Example 2: Is a 200+ million rows table that also need to have several calculations done to practically all rows. If a do the full set all in one, my box will run for three days and not even then be done. If I write a simple C# to execute the exact same SQL, but with WHERE-clauses appended to limit transaction size to 100k rows at a time, it'll be done in ~14 hours.

For the record: My results are from the same databases, resting on the same physical hardware, with statistics updated, no changes in indexes, Simple recovery model, etc.

And no, I haven't tried 'true' RBAR, although I probably should - even though it would only be to see how long that would really take.

回答1:

No, there is no rule that set-based is always faster. We have cursors for a reason (and don't be fooled into believing that a while loop or some other type of looping is really all that different from a cursor). Itzik Ben-Gan has demonstrated a few cases where cursors are much better, particularly for running totals problems. There are also cases you describe where you're trying to update 12 million rows and due to memory constraints, log usage or other reasons it's just too much for SQL to handle as a single operation without having to spill to tempdb, or settle on a sub-optimal plan from early termination due to not getting a more optimal plan quick enough.

One of the reasons cursors get a bad rap is that people are lazy and just say:

DECLARE c CURSOR FOR SELECT ...

When they almost always should be saying:

DECLARE c CURSOR 
    LOCAL FORWARD_ONLY STATIC READ_ONLY 
    FOR SELECT ...

This is because those extra keywords make the cursor more efficient for various reasons. Based on the documentation you would expect some of those options to be redundant, but in my testing this is not the case. See this blog post of mine and this blog post from fellow SQL Server MVP Hugo Kornelis for more details.

That all said, in most cases your best bet is going to be set-based (or at least chunky set-based as you described above). But for one-off admin tasks (which I hope your 12-million row update is), it is sometimes easier / more efficient to just write a cursor than to spend a lot of effort constructing an optimal query that produces an adequate plan. For queries that will be run a lot as normal operations within the scope of the application, those are worth more effort to try to optimize as set-based (keeping in mind that you may still end up with a cursor).

来源：https://stackoverflow.com/questions/7294338/performance-for-rbar-vs-set-based-processing-with-varying-transactional-sizes

标签

sql

sql-server

transaction-log

vldb