AI Advice - refactoring - row7

This is the full AI Advice output from sp_BlitzCache's refactoring personality, analyzing usp_ProcessCustomerStaging. The statement analyzed was DECLARE cur CURSOR FOR SELECT StagingID, CustomerID, Name, Email, City FROM dbo.CustomerStaging WHERE Processed = 0.

Summary of the problem

The procedure uses a client/server cursor and a row-by-row WHILE loop to process CustomerStaging rows where Processed = 0. For each staging row it does:

  1. IF EXISTS (Customer WHERE CustomerID = @CustomerID) THEN UPDATE Customer ELSE INSERT Customer
  2. UPDATE CustomerStaging SET Processed = 1 WHERE StagingID = @StagingID

The execution plan confirms the cursor performs a clustered index scan to build the cursor resultset, then for every fetched row does a seek on the Customer PK and then a per-row UPDATE/INSERT and then a per-row UPDATE back to the staging table.

Why the current approach is inefficient

Row-by-row (RBAR) processing: The cursor iterates rows one at a time. That results in many repeated single-row operations (exists/seek/update/insert), each with overhead for statement recompilation, plan invocation and many context switches between the query processor and storage engine. These repeated per-row seeks/updates are significantly slower than set-based operations because they cannot be optimized globally by the optimizer and cannot take advantage of set-based IO patterns.

Repeated single-row lookups: For each staging row the procedure performs EXISTS/seek against Customer. That causes many repeated index seeks that would be cheaper if done in one set-based join.

Multiple DML statements executed row-by-row: The procedure performs UPDATE or INSERT per row plus an UPDATE to CustomerStaging per row. This can be done in a few statements that operate on whole sets.

Potential correctness/performance issues with duplicates: If CustomerStaging has multiple rows for the same CustomerID, the cursor processes them sequentially and the final Customer row will reflect the last processed staging row. A naive set-based upsert that doesn't consider duplicates could produce different final values.

Plan shows forced serialization (CWT_ROWID), row-mode operators and many tiny operations — all symptomatic of RBAR.

Set-based rewrite

Goal: preserve behavior — process the set of staging rows where Processed = 0 and produce the same end-state: Customer table updated/inserted as the cursor would produce, and those staging rows marked Processed = 1.

Important note about duplicates: The cursor processes rows in the physical (clustered index) order. In your plan that corresponds to StagingID order. If there are multiple staging rows for the same CustomerID the cursor will apply them sequentially; the final Customer values will come from the last staging row processed (the largest StagingID seen by the cursor scan). To preserve the same final state we therefore choose, for each CustomerID, the staging row with the greatest StagingID (i.e., the "last" one in the cursor scan) and use that single row for upsert. This reproduces the effect of iterative "last wins" updates that the cursor would have produced.

Below is a complete set-based replacement of the stored procedure using a single transaction and set-based UPDATE/INSERT operations. The code:

  • Captures the staging rows to be processed
  • Reduces to one row per CustomerID choosing the last StagingID (ROW_NUMBER())
  • Performs a single UPDATE for existing customers
  • Performs a single INSERT for missing customers
  • Marks the staging rows as processed

Full CREATE PROCEDURE (set-based)

CREATE PROCEDURE dbo.usp_ProcessCustomerStaging
AS
BEGIN
    SET NOCOUNT ON;

    -- Use a transaction so the set of operations is atomic.
    BEGIN TRAN;

    /*
      1) To preserve the cursor's "last wins" behavior when there 
         are multiple staging rows per CustomerID we pick the row 
         with the highest StagingID per CustomerID 
         (ROW_NUMBER() ORDER BY StagingID DESC).
      2) We operate only on the snapshot of rows that had 
         Processed = 0 at the time we selected them.
    */

    ;WITH ToProcess AS (
        SELECT
            StagingID,
            CustomerID,
            Name,
            Email,
            City,
            ROW_NUMBER() OVER (
                PARTITION BY CustomerID 
                ORDER BY StagingID DESC) AS rn
        FROM dbo.CustomerStaging
        WHERE Processed = 0
    )
    -- 1) Update existing customers
    UPDATE C
    SET
        C.Name = T.Name,
        C.Email = T.Email,
        C.City = T.City,
        C.LastUpdated = GETDATE()
    FROM dbo.Customer AS C
    INNER JOIN ToProcess AS T
        ON C.CustomerID = T.CustomerID
    WHERE T.rn = 1;

    -- 2) Insert customers that do not exist yet
    INSERT INTO dbo.Customer (CustomerID, Name, Email, City, LastUpdated)
    SELECT
        T.CustomerID,
        T.Name,
        T.Email,
        T.City,
        GETDATE()
    FROM ToProcess AS T
    LEFT JOIN dbo.Customer AS C
        ON C.CustomerID = T.CustomerID
    WHERE T.rn = 1
      AND C.CustomerID IS NULL;

    -- 3) Mark all the staging rows as processed
    UPDATE S
    SET S.Processed = 1
    FROM dbo.CustomerStaging AS S
    INNER JOIN ToProcess AS T
        ON S.StagingID = T.StagingID;

    COMMIT TRAN;
END;

Why this is better

  • Fewer round-trips and fewer plan executions: one UPDATE, one INSERT, one UPDATE instead of thousands of small operations.
  • The optimizer can produce efficient set-based plans (hash joins/merge joins/partitioned work) and schedule IO efficiently.
  • Single GETDATE() per statement provides consistent LastUpdated values for each statement; function calls are much cheaper than repeated calls inside a cursor loop.
  • The ROW_NUMBER trick preserves the cursor's "last wins" behavior for duplicate CustomerID rows.
  • The final UPDATE on CustomerStaging sets Processed = 1 for all selected rows in one statement, avoiding per-row updates.

Additional notes and optional improvements

Concurrency: The original cursor had similar concurrency semantics. If you have multiple concurrent executions that must not process the same staging row twice, consider a "claiming" pattern: UPDATE CustomerStaging SET Processed = 2 WHERE Processed = 0; capture those rows WHERE Processed = 2 and process them, finally set Processed=1. Or use an OUTPUT clause to atomically capture rows you will process and then operate on that captured set.

Indexing: If queries to find unprocessed rows are common and Processed has mostly 1 values, a filtered index helps: CREATE INDEX IX_CustomerStaging_Processed0 ON dbo.CustomerStaging(Processed) WHERE Processed = 0 or a covering filtered index on (Processed) INCLUDE (StagingID, CustomerID, Name, Email, City).

Error handling: The cursor version would stop on a runtime error. If you rely on that behavior, add TRY/CATCH to match desired behavior.

MERGE: A MERGE can also perform the upsert in one statement, but MERGE has specific concurrency and semantics caveats in some SQL Server versions. The pattern above uses simple UPDATE/INSERT which is widely recommended for clarity and predictability.

GETDATE() semantics: If preserving microsecond-level differences in LastUpdated for each row is necessary (i.e., GETDATE() called per row in the cursor loop), we can compute a per-row timestamp (e.g., DATEADD(ms, ROW_NUMBER() OVER (ORDER BY StagingID), GETDATE())) and use that in the set operations.

No comments:

Post a Comment