AI Advice - refactoring - row6

This is the full AI Advice output from sp_BlitzCache's refactoring personality, analyzing usp_ProcessCustomerStaging. The statement analyzed was FETCH NEXT FROM cur INTO @StagingID, @CustomerID, @Name, @Email, @City.

Summary

Current proc usp_ProcessCustomerStaging uses a client-style cursor + WHILE loop and does a per-row IF EXISTS → UPDATE/INSERT and then updates the staging row. This causes repeated single-row lookups, many context switches, and row-by-row DML.

I converted the logic into set-based operations: compute one "effective" row per customer (the latest staging row per CustomerID), run a single UPDATE then a single INSERT for the new customers, and mark all processed staging rows in one set-based UPDATE. This preserves the end-state produced by the original procedure (Customer table reflects the last staging row per CustomerID; all staging rows that were Processed = 0 are set to 1).

Findings (from the plan + query text)

  • Cursor + row-by-row loop over CustomerStaging where Processed = 0.
  • For each row:
    • A single-row EXISTS check against Customer (clustered index seek).
    • Either an UPDATE or an INSERT into Customer (clustered index operations).
    • Then UPDATE CustomerStaging SET Processed = 1 for the current StagingID.
  • Plan shows many repeated seeks and single-row operations. The plan also shows a clustered index scan of CustomerStaging to feed the cursor.

Why this is inefficient

Row-by-row processing (RBAR): Each staging row results in its own network/CPU/context switch and separate DML. For N rows this produces O(N) round trips inside SQL engine internals and repeated index seeks/updates instead of a few set-based operations.

Repeated single-row lookups: The EXISTS check does a clustered index seek per staging row.

Many separate UPDATE/INSERT operations: Increase logging/locking overhead vs. grouping changes into a few batch operations.

Cursors are generally non-scalable: Memory and CPU usage per row rises with more rows and concurrency.

Plan warnings (Cursor; optimistic; dynamic; multiple plans) confirm the cursor usage and plan instability risk.

Behavioral considerations to preserve original behavior

The original procedure processes staging rows in the cursor order (clustered PK order by StagingID). If there are multiple staging rows for the same CustomerID, the last one processed (largest StagingID) determines the final Customer.* values in the original.

The original marks every staging row processed (Processed = 1), not just the last per CustomerID.

GETDATE() in the original is called per row; the set-based option will use GETDATE() at statement evaluation time. If you need per-row distinct LastUpdated timestamps corresponding to each sequential row-processing time, that requires more complex logic; usually capturing a single timestamp for the batch is acceptable and consistent with typical refactors.

Set-based rewrite (full T-SQL stored procedure)

This version:

  • Captures the latest row per CustomerID among the unprocessed staging rows (Processed = 0) using ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY StagingID DESC).
  • Updates Customer for matching CustomerID rows in a single set-based UPDATE.
  • Inserts new Customer rows in a single set-based INSERT using NOT EXISTS (avoids MERGE duplicate-key issues if staging contains duplicate CustomerIDs).
  • Marks all staging rows that were Processed = 0 as Processed = 1 in a single UPDATE.

No cursor, no loop.

Full procedure code

CREATE PROCEDURE dbo.usp_ProcessCustomerStaging_SetBased
AS
BEGIN
    SET NOCOUNT ON;

    /*
      Approach:
      1) Select only staging rows with Processed = 0.
      2) For each CustomerID, pick the latest staging row 
         (highest StagingID) as the authoritative row.
      3) Do a set-based UPDATE for existing customers, 
         then a set-based INSERT for new customers.
      4) Mark all staging rows that were Processed = 0 
         as Processed = 1.
    */

    -- Single timestamp for all changes in this batch:
    DECLARE @Now DATETIME = GETDATE();

    ;WITH Stg AS
    (
        SELECT
            cs.StagingID,
            cs.CustomerID,
            cs.Name,
            cs.Email,
            cs.City,
            ROW_NUMBER() OVER (
                PARTITION BY cs.CustomerID 
                ORDER BY cs.StagingID DESC) AS rn
        FROM dbo.CustomerStaging cs
        WHERE cs.Processed = 0
    ),
    -- One authoritative row per CustomerID
    Dedup AS
    (
        SELECT StagingID, CustomerID, Name, Email, City
        FROM Stg
        WHERE rn = 1
    )
    -- 1) Update existing customers in a single statement
    UPDATE C
    SET
        C.Name = D.Name,
        C.Email = D.Email,
        C.City = D.City,
        C.LastUpdated = @Now
    FROM dbo.Customer AS C
    INNER JOIN Dedup AS D
        ON C.CustomerID = D.CustomerID

    -- 2) Insert new customers
    INSERT INTO dbo.Customer (CustomerID, Name, Email, City, LastUpdated)
    SELECT
        D.CustomerID,
        D.Name,
        D.Email,
        D.City,
        @Now
    FROM Dedup AS D
    WHERE NOT EXISTS (
        SELECT 1
        FROM dbo.Customer AS C2
        WHERE C2.CustomerID = D.CustomerID
    );

    -- 3) Mark all processed staging rows as Processed = 1
    UPDATE cs
    SET cs.Processed = 1
    FROM dbo.CustomerStaging cs
    WHERE cs.Processed = 0;

    -- No explicit transaction used so statements run with 
    -- autocommit, which matches the per-row implicit commit 
    -- flavor of the original cursor code.
END

Notes, alternatives and caveats

MERGE alternative: A MERGE could be used if you can guarantee the source set has no duplicate join keys. MERGE fails if source rows contain duplicate keys matching target. The dedup CTE prevents duplicates; you could MERGE Dedup → Customer safely. Many shops prefer the UPDATE + INSERT pattern for clarity and to avoid MERGE corner cases.

GETDATE() semantics: Original called GETDATE() per row at update/insert time. This refactor uses a single timestamp (@Now) for all affected rows. That makes LastUpdated consistent for the batch; if you absolutely need the original per-row timestamps (slightly different times), you can compute a timestamp per Dedup row using a ROW_NUMBER and add milliseconds, but that's rarely necessary and complicates the logic.

Duplicates in staging: This rewrite uses the latest (by StagingID) staging row per CustomerID to determine the final state, which matches the original cursor (processed in ascending StagingID; last one wins).

Marking staging processed: The final UPDATE marks all rows that were Processed = 0 at the moment the statements run — that matches original behavior where each staging row is updated to Processed = 1.

Concurrency: Because the original did many tiny DMLs, concurrency characteristics are different than the set-based single-shot DMLs in this rewrite. If you want exact same locking/transactional behavior you could leave statements unwrapped (as above). If you want atomicity between Customer changes and marking staging processed, wrap the statement group in an explicit transaction and use SET XACT_ABORT ON.

Performance: Set-based UPDATE/INSERT dramatically reduces context switches and repeated seeks; it scales much better for thousands of staging rows. Expect much lower CPU and fewer logical reads per-row.

Index and statistics suggestions

Ensure Customer has a clustered PK on CustomerID (the plan already shows one) and statistics up-to-date. The join/update/insert logic uses equality on CustomerID, so that key/index is essential.

CustomerStaging: having Processed as a narrow column and a nonclustered index on (Processed, StagingID) can help quickly find unprocessed rows and preserve ordering by StagingID for the ROW_NUMBER evaluation. Example: CREATE INDEX IX_CustomerStaging_Processed_StagingID ON dbo.CustomerStaging (Processed, StagingID) INCLUDE (CustomerID, Name, Email, City)

No comments:

Post a Comment