Wednesday, May 6, 2026

Three T-SQL Tricks You May Not Be Using Yet

Three T-SQL features that have shipped over the last few releases and quietly retired patterns many of us are still using out of habit. Each replaces a stale workaround with one line of code, and in two of three cases it runs much faster, too. Take a look, try them out.

1. APPROX_COUNT_DISTINCT -- when COUNT(DISTINCT) is too expensive

Available since SQL Server 2019. COUNT(DISTINCT col) on a high-cardinality column requires SQL Server to track every unique value it has seen, which means a memory grant proportional to cardinality and a hash aggregate that loves to spill to tempdb on big tables. APPROX_COUNT_DISTINCT uses HyperLogLog instead, which trades a small accuracy hit for dramatically lower memory and faster runs.

Haven't heard of HyperLogLog before? Short story: HyperLogLog (HLL) in SQL Server is used to provide extremely fast, approximate counts of unique values (cardinality) in massive datasets using very little memory. Implemented with the APPROX_COUNT_DISTINCT function, allowing analytics on billions of rows with a roughly 2% error margin, significantly faster than COUNT(DISTINCT). Pretty cool. I use this a lot in trading data.

Build a five-million-row table with a few hundred thousand distinct customers (takes a few seconds on a laptop):

WITH N AS
(
    SELECT TOP (5000000)
        OrderID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM sys.all_objects a
    CROSS JOIN sys.all_objects b
    CROSS JOIN sys.all_objects c
)
SELECT
    OrderID,
    CustomerID = ABS(CHECKSUM(NEWID())) % 250000,
    OrderDate = DATEADD(SECOND, OrderID, '2020-01-01'),
    Amount = CAST(OrderID % 1000 AS DECIMAL(10,2))
INTO dbo.OrdersDemo
FROM N;

CREATE CLUSTERED INDEX IX_OrdersDemo_OrderID ON dbo.OrdersDemo (OrderID);

The old way:

SET STATISTICS TIME, IO ON;

SELECT COUNT(DISTINCT CustomerID) AS UniqueCustomers
FROM dbo.OrdersDemo;

The new way:

SELECT APPROX_COUNT_DISTINCT(CustomerID) AS UniqueCustomers
FROM dbo.OrdersDemo;

SET STATISTICS TIME, IO OFF;

Run both, compare CPU time and elapsed time, or even pull the actual execution plan to compare the memory grant on the root operator. You can clearly see, APPROX_COUNT_DISTINCT finishes noticeably faster with a smaller grant.

And the memory grant...

Don't use it for anything that has to be exact -- billing, regulatory counts, reconciliation. Do use it for the dozens of dashboards and exploratory queries where 'about a million' is the same answer as 'exactly 1,003,847'.

2. STRING_AGG -- retire the FOR XML PATH('') hack

Available since SQL Server 2017. Many shops still use legacy code building delimited strings with FOR XML PATH('') and STUFF(). It worked, technically. It was also suboptimal performance.

Setup -- a tiny order schema with a few items per order:

CREATE TABLE dbo.Products
(
    ProductID    INT PRIMARY KEY,
    ProductName  NVARCHAR(50)
);

CREATE TABLE dbo.Orders
(
    OrderID      INT PRIMARY KEY,
    CustomerID   INT
);

CREATE TABLE dbo.OrderItems
(
    OrderID    INT,
    ProductID  INT
);

INSERT dbo.Products VALUES
(1, 'Coffee'), (2, 'Tea'), (3, 'Cookies'), (4, 'Cake'), (5, 'Sandwich');

INSERT dbo.Orders VALUES
(1001, 50), (1002, 50), (1003, 51), (1004, 52);

INSERT dbo.OrderItems VALUES
(1001, 1), (1001, 3),
(1002, 2), (1002, 4), (1002, 5),
(1003, 1), (1003, 2),
(1004, 5);

The old way:

SELECT
    o.OrderID,
    o.CustomerID,
    Products = STUFF((
        SELECT ', ' + p.ProductName
        FROM dbo.OrderItems oi JOIN dbo.Products p 
          ON oi.ProductID = p.ProductID
        WHERE  oi.OrderID = o.OrderID
        FOR XML PATH(''), TYPE
    ).value('.', 'NVARCHAR(MAX)'), 1, 2, '')
FROM dbo.Orders o
ORDER BY o.OrderID;

The new way:

SELECT
    o.OrderID,
    o.CustomerID,
    Products = STRING_AGG(p.ProductName, ', ')
                 WITHIN GROUP (ORDER BY p.ProductName)
FROM dbo.Orders o JOIN dbo.OrderItems oi 
  ON o.OrderID = oi.OrderID JOIN dbo.Products p
    ON oi.ProductID = p.ProductID
GROUP BY o.OrderID, o.CustomerID
ORDER BY o.OrderID;

Both produce the same resultset:

But look at that implicit conversion, larger plan cache and higher subtree cost in one vs the other:

3. GENERATE_SERIES -- kill the homemade tally CTE

This one was introduced in SQL Server 2022. Every DBA has copy-pasted some variant of a recursive CTE to generate a sequence of numbers or dates. The classic use case: a date-spine report that fills in zeros for days with no activity. GENERATE_SERIES replaces it with one function call.

Setup -- a sales table with deliberately sparse dates:

CREATE TABLE dbo.DailySales
(
    SaleDate DATE,
    Amount   DECIMAL(10,2)
);

INSERT dbo.DailySales VALUES
('2026-01-02', 1200.00),
('2026-01-03',  450.00),
('2026-01-05', 2300.00),
('2026-01-08',  890.00),
('2026-01-10', 1100.00);

The old way -- recursive CTE for the date spine:

DECLARE @StartDate DATE = '2026-01-01';
;WITH Days AS
(
    SELECT @StartDate AS TheDate, 0 AS n
    UNION ALL
    SELECT DATEADD(DAY, n + 1, @StartDate), n + 1
    FROM Days
    WHERE n < 9
)
SELECT
    d.TheDate,
    Amount = ISNULL(s.Amount, 0)
FROM Days d LEFT JOIN dbo.DailySales s 
  ON s.SaleDate = d.TheDate
ORDER BY d.TheDate
OPTION (MAXRECURSION 0);

The new way:

DECLARE @StartDate DATE = '2026-01-01';
SELECT
    TheDate = DATEADD(DAY, gs.value, @StartDate),
    Amount = ISNULL(s.Amount, 0)
FROM GENERATE_SERIES(0, 9) gs LEFT JOIN dbo.DailySales s
  ON s.SaleDate = DATEADD(DAY, gs.value, @StartDate)
ORDER BY TheDate;

Both produce the same ten-row date spine with zeros filled in for the empty days, but compare the two constructs. With GENERATE_SERIES we've got no CTE, no MAXRECURSION and no helper table. GENERATE_SERIES accepts (start, stop) or (start, stop, step), supports negative steps, and returns a single column called value. The Optimizer treats it as a normal table-valued function and can parallelize work that depends on it -- something the recursive CTE cannot do.

Why we are still writing the old versions

Full disclosure: this post happened because I caught myself reaching for a recursive CTE last night before remembering the GENERATE_SERIES improvement. Habits stick. But they can be broken easily.

The recursive tally CTE was the first 'clever' T-SQL many of us wrote. The FOR XML PATH trick was THE Stack Overflow answer for a decade. COUNT(DISTINCT) still gets typed reflexively when 'roughly how many' would do.

None of these replacements are new. STRING_AGG turned eight this year. APPROX_COUNT_DISTINCT shipped in 2019, and GENERATE_SERIES has been around since 2022. Time to check them out.

More to Read

Microsoft Learn: APPROX_COUNT_DISTINCT
Microsoft Learn: STRING_AGG
Microsoft Learn: GENERATE_SERIES

No comments:

Post a Comment