Thursday, January 29, 2026

Which Indexes Are Missing? — 2025 Edition

Fifteen years ago, I wrote a little post about the missing index DMVs. Six major SQL Server releases later, it's time for an update.

The Old Query: Still Kicking

Here's the gist of what we had back then:

SELECT 
    avg_user_impact average_improvement_percentage, 
    avg_total_user_cost average_cost_of_query_without_missing_index, 
    'CREATE INDEX idx_' + [statement] +  
    ISNULL(equality_columns, '_') +
    ISNULL(inequality_columns, '_') + ' ON ' + [statement] + ' (' + ISNULL(equality_columns, ' ') + 
    ISNULL(inequality_columns, ' ') + ')' + ISNULL(' INCLUDE (' + included_columns + ')', '') create_missing_index_command
FROM sys.dm_db_missing_index_details a INNER JOIN sys.dm_db_missing_index_groups b
  ON a.index_handle = b.index_handle INNER JOIN sys.dm_db_missing_index_group_stats c
    ON b.index_group_handle = c.group_handle
WHERE avg_user_impact > = 40

This still runs fine. If you're on SQL Server 2005-2017, or you just need a quick-and-dirty list, it gets the job done.

But it has a couple blind spots:

  • You see suggestions, but not which queries need them
  • No idea if the suggestion is from current or stale statistics

What's New Since 2010

The Big One: sys.dm_db_missing_index_group_stats_query

Introduced in v2019, this links missing index suggestions to actual queries via query_hash. No more guessing which query is crying for help — you can see it directly.

Freshness Check: last_user_seek and last_user_scan

The freshness columns -- last_user_seek and last_user_scan -- were always there, but I didn't include them in my 2010 post.  I should have.  They tell you exactly when the optimizer last wanted that index.  A suggested index based on current stats is very different from one using old/outdated statistics.

Cleaner T-SQL

We've got CONCAT, CONCAT_WS, and FORMAT now. No more nested ISNULL chains.

The 2025 Query

/*
    Missing Index Analysis
    sqlfingers.com | 2025
    Requires: SQL Server 2019+
*/

SELECT 
    ROW_NUMBER() OVER (
        ORDER BY s.avg_total_user_cost * s.avg_user_impact * (s.user_seeks + s.user_scans) DESC) priority,
    CAST(s.avg_user_impact AS INT) [impact_%],
    FORMAT(s.user_seeks, 'N0') seeks,
    FORMAT(s.user_scans, 'N0') scans,
    FORMAT(s.last_user_seek, 'yyyy-MM-dd HH:mm') last_seek,
    DB_NAME(d.database_id) db,
    OBJECT_SCHEMA_NAME(d.object_id, d.database_id) [schema],
    OBJECT_NAME(d.object_id, d.database_id) [table],
    d.equality_columns,
    d.inequality_columns,
    d.included_columns,
    CONCAT(
        'CREATE NONCLUSTERED INDEX [IX_',
        OBJECT_NAME(d.object_id, d.database_id), '_',
        REPLACE(REPLACE(REPLACE(
            CONCAT_WS('_', d.equality_columns, d.inequality_columns),
            '[', ''), ']', ''), ', ', '_'),
        '] ON [', OBJECT_SCHEMA_NAME(d.object_id, d.database_id), '].[', 
        OBJECT_NAME(d.object_id, d.database_id), '] (',
        CONCAT_WS(', ', d.equality_columns, d.inequality_columns), ')',
        IIF(d.included_columns IS NOT NULL, 
            CONCAT(' INCLUDE (', d.included_columns, ')'), ''),
        ';'
    ) create_statement
FROM sys.dm_db_missing_index_details d INNER JOIN sys.dm_db_missing_index_groups g 
  ON d.index_handle = g.index_handle INNER JOIN sys.dm_db_missing_index_group_stats s 
    ON g.index_group_handle = s.group_handle
WHERE d.database_id = DB_ID()
ORDER BY s.avg_total_user_cost * s.avg_user_impact * (s.user_seeks + s.user_scans) DESC;

Bonus: Find the Query in Need

This one is fun. We can use sys.dm_db_missing_index_group_stats_query to see exactly which query wants the index:

SELECT 
    CAST(sq.avg_user_impact AS INT) [impact_%],
    OBJECT_NAME(d.object_id, d.database_id) [table],
    d.equality_columns,
    d.inequality_columns,
    LEFT(t.query_sql_text, 150) query_snippet
FROM sys.dm_db_missing_index_details d INNER JOIN sys.dm_db_missing_index_groups g 
  ON d.index_handle = g.index_handle INNER JOIN sys.dm_db_missing_index_group_stats_query sq 
    ON g.index_group_handle = sq.group_handle INNER JOIN sys.query_store_query q 
      ON sq.query_hash = q.query_hash INNER JOIN sys.query_store_query_text t 
        ON q.query_text_id = t.query_text_id
WHERE d.database_id = DB_ID()
ORDER BY sq.avg_user_impact DESC;

Now we're not just seeing such-n-such index on tablename.   We get the actual SELECT statement that is suffering.   Pure gold.

Quick Caveats

  • DMV stats reset on restart — let your server run a full business cycle before trusting these numbers
  • Not every suggestion is gold — the optimizer doesn't consider write overhead or maintenance cost. You must always fully review and test the recommended indexes rather than trusting them blindly.
  • Look for overlaps — if you see 3 suggestions for the same table, you might consolidate into 1

More to Read:

Original 2010 Post: Which indexes are missing?
Microsoft Docs: sys.dm_db_missing_index_group_stats_query

Happy indexing.

SQL Server Cloud Permissions: Management Plane vs Data Plane

My last post ended with a promise to explain the management plane vs. data plane split in practical terms. The short story? There are two separate permission systems — one for managing resources, one for accessing data — and they don't talk to each other. The good news: you already understand this model. You just don't recognize it yet.

The Pattern You Already Know

In SQL Server, we have server-level permissions and database-level permissions. A login can exist at the server level but have zero access to any database. A user can exist in a database but not map to any login. They are separate layers, and you need both configured correctly for anyone to be able to do anything.

Cloud works the same way — just with different names and a portal that hides all the pieces.

SQL Server Cloud Equivalent What It Controls
Server-level Management plane Can you see it? Configure it?
Database-level Data plane Can you use it? Read/write data?

When someone says 'I have Contributor access but can't read the blob' — that's like having a SQL Server login with no database user mapped. You can connect to the instance, but you can't touch the data. Different layer, same pattern.

Why the Naming Hurts

The problem isn't the model. The problem is the names.

In SQL Server, db_datareader lets you read data. db_datawriter lets you write data. These names make sense. In Azure, Contributor sounds like it should let you contribute... something. Maybe data? Nope. It lets you contribute configuration changes to the resource. Want to contribute actual data? You need Storage Blob Data Contributor.

Here's the mental model that helped me:

Management plane roles answer: 'What can you do to this resource?'
Create it, delete it, configure it, see it in the portal, change its settings.

Data plane roles answer: 'What can you do with this resource?'
Read files, write data, execute queries, upload documents.

The Practical Combinations

Let's get specific. For Azure Blob Storage — the scenario from my last post — here's what each combination actually gets you:

ARM Role Data Role Result
Reader none Can see storage account in portal, no data access
none Storage Blob Data Contributor Can read/write via CLI or API, invisible in portal
Reader Storage Blob Data Reader Can see in portal, read-only data access
Reader Storage Blob Data Contributor Can see in portal, full data access
Contributor none Can configure storage account, no data access
Contributor Storage Blob Data Contributor Full configuration + full data access

Notice the pattern: Reader + Storage Blob Data Contributor is often the sweet spot for end users. They can find the resource and work with the data, but they can't accidentally reconfigure or delete the storage account itself.

Azure SQL Has the Same Split

This isn't just a storage thing. Azure SQL Database has the same model:

Layer Role Example What It Does
Management SQL DB Contributor Manage servers/databases in portal, firewall rules, scaling
Data Entra ID user + db role Connect and run queries, read/write data

A user with SQL DB Contributor can manage databases in the portal — change performance tiers, adjust backup retention — but if they're not added as a user inside the database, they can't run SELECT * FROM Customers. Same pattern. Management vs Data.

The Five-Second Diagnostic

When someone says "I have the right permissions but I'm getting Access Denied", ask two questions:

1. Can they see the resource in the portal?
No → Missing management plane role (ie., Reader)

2. Can they see it but not use it?
Yes → Missing data plane role (ie., Storage Blob Data Contributor)

That's it. You've just diagnosed 80% of SQL Server's 'Access Denied' mysteries in the Cloud.

One More Thing: Scope Matters

Roles can be assigned at different scopes: management group, subscription, resource group, or the individual resource. When troubleshooting, always check where the role is assigned, not just what role is assigned. The portal's Access Control (IAM) blade shows inherited permissions, which helps — but inherited data plane roles don't always cascade the way you'd expect. Data plane roles do inherit down the hierarchy — but container-level assignments and the management/data split can create surprises.

The Takeaway

Management plane vs Data plane isn't a new concept. It's server-level vs database-level with cloud branding. Once you see it that way, it really minimizes the confusion.

Next Up: Minimum permission sets for common DBA tasks

More to Read:

Microsoft: Assign Azure roles using the Azure portal
Microsoft: Choose how to authorize access to blob data in the Azure portal
Microsoft: Azure SQL Database security overview

Wednesday, January 28, 2026

Microsoft's January 2026 Patch Tuesday: CVE-2026-20803: SQL Server Privilege Escalation

Microsoft's January 2026 Patch Tuesday included a security fix for SQL Server: CVE-2026-20803, an elevation of privilege vulnerability with CVSS score 7.2 (Important).

The vulnerability is classified as CWE-306: Missing Authentication for Critical Function. An attacker who already has high-level privileges on the SQL Server instance could exploit this flaw to escalate further — gaining debugging privileges, dumping system memory, and potentially extracting sensitive data or credentials.

The Details

Attribute Value
CVE ID CVE-2026-20803
Severity Important (CVSS 7.2)
Attack Vector Network
Attack Complexity Low
Privileges Required High
User Interaction None
Impact High (Confidentiality, Integrity, Availability)
Exploitability Exploitation Less Likely

The CVSS vector is: CVSS:3.1/AV:N/AC:L/PR:H/UI:N/S:U/C:H/I:H/A:H

What Does This Mean in Practice?

This is not a random attacker on the internet vulnerability. The attacker needs high-level privileges — a compromised service account, an insider threat, or lateral movement from an already-breached system, among others.

Once exploited, the attacker could gain debugging access and dump memory — which could expose connection strings, credentials, decrypted data in memory, and other sensitive information that shouldn't be accessible even to privileged users.

The 'Exploitation Less Likely' rating means Microsoft hasn't seen active exploitation and doesn't expect it to be trivially weaponized. But 'less likely' is not 'impossible'.

Which Versions Are Affected?

SQL Server 2022 and SQL Server 2025 are affected. Microsoft has released patches through both the GDR (security-only) and CU (cumulative update) tracks.

Version Update Track KB Patched Build
SQL Server 2025 GDR KB5073177 17.0.1050.2
SQL Server 2022 CU22 + GDR KB5072936 16.0.4230.2
SQL Server 2022 GDR KB5073031 16.0.1165.1

GDR vs. CU: Which Patch Do I Need?

If you're not familiar with Microsoft's SQL Server servicing model:

  • GDR (General Distribution Release) — Security fixes only. Minimal changes. Use this if you're on a baseline RTM build and want to stay conservative.
  • CU (Cumulative Update) — Security fixes plus all other bug fixes and improvements. Use this if you're already on the CU train.

Check your current build in comparison to the table above and review the KB articles for the correct patch:

SELECT @@VERSION;

How to Patch

Download the appropriate update from the links above or from the Microsoft Update Catalog.

Before patching:

  • Back up your databases
  • Test in a non-production environment first
  • Schedule a maintenance window — the SQL Server service will restart

After patching, verify the new build number:

SELECT 
    SERVERPROPERTY('ProductVersion') AS ProductVersion,
    SERVERPROPERTY('ProductLevel') AS ProductLevel,
    SERVERPROPERTY('Edition') AS Edition;

Should You Panic?

No. The attack requires high privileges to begin with, and there's no evidence of active exploitation. But 'exploitation less likely' is not the same as 'never going to happen'.

Patch during your next maintenance window. Don't wait six months.

If your SQL Server instances are exposed to the internet or if you have concerns about insider threats or compromised service accounts, you may wish to prioritize this patch.

A Note on Timing

This security patch (released January 13) is separate from CU23 for SQL Server 2022 and CU1 for SQL Server 2025 (both released January 15). Those cumulative updates have their own problems — specifically, a Database Mail bug that broke alerting.

The security patches (KB5073177, KB5072936, KB5073031) do not have the Database Mail bug. If you need to patch for CVE-2026-20803 but can't risk breaking Database Mail, use the GDR security update rather than the CU.

More to Read:

Microsoft Security Response Center: CVE-2026-20803
KB5073177: Security Update for SQL Server 2025 GDR
KB5073031: Security Update for SQL Server 2022 GDR
KB5072936: Security Update for SQL Server 2022 CU22+GDR

Database Mail Bug Forces Microsoft to Pull SQL Server 2025 CU1... and 2022 CU23

Microsoft pulled both SQL Server 2025 CU1 and SQL Server 2022 CU23 last week. The reason: Database Mail is completely broken after installing either update.

Could not load file or assembly 'Microsoft.SqlServer.DatabaseMail.XEvents, 
Version=17.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91' 
or one of its dependencies. The system cannot find the file specified.

If you've already installed one of these updates, your alert emails aren't going out. Your job failure notifications are silent. That monitoring you set up years ago? Dead.

Which Updates Are Affected?

Version Broken Update Status
SQL Server 2025 CU1 (KB5074901) Pulled from download
SQL Server 2022 CU23 (KB5074819) Pulled from download

The RTM builds of both versions are fine. If you're on SQL Server 2025 RTM or SQL Server 2022 CU22 or earlier, you're not affected — as long as you don't update.

How to Check If You're Affected

Run this to check your current build:

SELECT @@VERSION;

Or just run this to test your Database Mail:

EXEC msdb.dbo.sp_send_dbmail
    @profile_name = 'YourMailProfile',
    @recipients = 'youraddress@example.com',
    @subject = 'Database Mail Test',
    @body = 'Testing after CU install';

If it fails with the assembly error above, you've been hit.

The Fix: Uninstall the CU

Microsoft's official guidance is blunt: "If you already installed this update, uninstall it to restore Database Mail functionality."

To uninstall a Cumulative Update:

Windows Server

Go to Control Panel → Programs and Features → View installed updates. Find the SQL Server CU, right-click, and uninstall. You'll need to restart the SQL Server service (and possibly the server).

Microsoft has detailed uninstall instructions here: Uninstall a Cumulative Update from SQL Server

Emergency Workaround (If You Can't Uninstall Yet)

If you're stuck on the broken CU and can't perform an immediate uninstall, there's a community workaround floating around. Some DBAs have reported success copying three DLL files from a working (pre-CU) instance:

From a working SQL Server 2022 instance (pre-CU23), copy these files from C:\Program Files\Microsoft SQL Server\MSSQL16.MSSQLSERVER\MSSQL\Binn:

  • DatabaseMail.exe
  • DatabaseMailengine.dll
  • DatabaseMailprotocols.dll

Replace the same files on the broken instance. Restart the SQL Server Agent service.

I haven't personally tested this workaround, and it's not officially supported. Use at your own risk. The proper fix is to uninstall the CU.

Check Your Unsent Mail Queue

If Database Mail was broken for any length of time, you may have queued messages that never went out. Check with:

SELECT 
    mailitem_id,
    profile_id,
    recipients,
    subject,
    send_request_date,
    sent_status,
    sent_date
FROM msdb.dbo.sysmail_allitems
WHERE sent_status <> 'sent'
ORDER BY send_request_date DESC;

After you restore Database Mail functionality, these queued messages should attempt to send. If they don't, you may need to manually resend critical notifications.

The Bigger Picture

This is CU1 for SQL Server 2025. The product went GA in November. Two months later, the first cumulative update breaks a core feature that's been in the product for over 20 years.

As one commenter on Brent Ozar's blog put it: "I started skeptical yet secretly hopeful MS would get its act together for this version, but my days of being unimpressed with Nadella's Microsoft quality standards are coming to a middle."

The lesson here isn't new, but it bears repeating: test your CUs before deploying to production. Even CU1. Especially CU1.

Microsoft will presumably release a fix soon. Until then, hold off on both SQL Server 2025 CU1 and SQL Server 2022 CU23 if you use Database Mail. Which is, I think, pretty much most of us.

More to Read:

Microsoft: SQL Server 2025 CU1 (KB5074901) — includes Database Mail warning
Microsoft: SQL Server 2022 CU23 (KB5074819)
Brent Ozar: SQL Server 2025 CU1 is Off to a Rough Start
Microsoft Q&A: DatabaseMail.exe fails after SQL Server 2022 CU22 to CU23
Microsoft: How to Uninstall a Cumulative Update

Tuesday, January 27, 2026

Cloud Security for SQL Server: The Complexity Nobody Warned You About

Cloud-based SQL Server security isn't simpler. It's different — and the learning curve is brutal if you grew up on-prem.

If you've spent years working Windows authentication, SQL logins, role memberships, and the occasional certificate, you may assume cloud security is more of that, just with a portal. Not. The SQL Server/Cloud permission models are layered differently, the terminology shifts depending on the platform, and the people who 'own' security are spread across teams that don't always speak the same language.

The Problem: Two Permission Systems, One Confusing Experience

The core issue exists across all major cloud platforms: There is a split between management plane permissions (who can see and configure resources in the portal) and data plane permissions (who can actually read/write the data).

Platform Mgmt Plane Data Plane
Azure ARM roles Storage Data roles
AWS IAM policies Resource/bucket policies
GCP Cloud IAM Data access scopes

The result is the same everywhere: A user can have 'Contributor' access to a resource and still get 'Access Denied' when they try to use it. The terminology varies, the confusion is universal.

And it gets worse. Identity sync delays can mean waiting hours for permission changes to take effect. Role inheritance doesn't behave like on-prem Active Directory. Informal titles like 'tenant admin' or 'cloud admin' mean different things to different organizations. And sometimes even different things to different people in the same organization.

Microsoft's own documentation acknowledges that 'Achieving Separation of Duties (SoD) is challenging for security-related or troubleshooting tasks'. That's not a criticism. It's an honest admission that the model is complex, even for the people who built it.

A Real-World Example

A customer reached out with what seemed a simple ask: "We have a new Blob Storage account and we can't assign permissions so I can add documents to it."

Their security admin had created the storage account and container. He tried to grant the requesting user access, but it didn't work. The user couldn't see it in the portal, let alone add documents. They had a tight timeline — this was for a proof-of-concept analytics portal and they needed it sorted fast.

My first question: "When you say he 'was not able to grant access' does that mean he tried and it failed, or he granted it and you still can't see anything?"

The answer: "It was more the former. He was trying to add the role assignments and it didn't work as we expected."

Then my customer said: "John (name change) has tenant admin to Azure, so I don't think it will be a limitation on his side."

Here's the problem with that assumption.

What 'Tenant Admin' Actually Means

If by 'tenant admin' they meant Global Administrator in Entra ID (formerly Azure AD), that role gives full control over users in the directory. It does not automatically grant permission to manage Azure resources like a storage account. To do that, the Global Admin would need to perform an 'elevation' step to get User Access Administrator at the root scope, or someone would need to explicitly assign them Owner on the resource.

More likely, though, is that they just hadn't assigned the right combination of roles.

To successfully access a blob container in the Azure portal, a user needs permissions in two distinct systems:

Role Type What It Does
ARM Role (ie., Reader) Lets you see the storage account in the portal and navigate to its resources
Storage Data Role (ie., Storage Blob Data Contributor) Lets you read/write/delete data within the blob container

If you only assign Reader, they can see it but can't touch the data. If you only assign Storage Blob Data Contributor, the user can't see the storage account in the portal. You need both.

We got on a call with the customer, their cloud admin, and our cloud expert. Within minutes, our cloud guy found the hole — a missing role assignment — applied it in the portal, and the problem resolved immediately.

Cause? They just didn't know about the dual-permission model. And why would they? It's not very intuitive. The terminology is inconsistent. And 'Contributor' sure sounds like it would let you contribute.

Even the Experts Draw a Line

Here's something worth noting: some of the most respected SQL Server voices in the industry explicitly exclude security and permissions work from their practice contractually. Not because they can't learn it or don't know it -- but because the complexity and risk warrant specialization. When top-tier SQL Server experts are steering clear of security questions entirely, that tells you something about the depth of this space.

The idea that 'the cloud handles security for you' is a widespread, largely misleading myth. Cloud platforms remove some infrastructure tasks — automated backups, patching, failover — but they don't validate that your permission model makes sense. They don't catch the gap between what you think a role grants and what it actually grants. And they certainly don't ensure your DBA and cloud teams are aligned on who can do what.

Where Does the DBA Fit?

Cloud security is most definitely not DBA-only territory, but DBAs should understand it well enough to:

• Recognize when something is a permission issue vs a configuration issue

• Speak the same language as their cloud teams

• Catch issues that slip through the DBA to Cloud Team handoff

The most effective DBAs I've worked with aren't the ones who know everything about cloud IAM. They're the ones who know enough to ask the right questions — and have good communications with their cloud counterparts so the questions get answered.

Coming Next

This post is just a primer. An acknowledgment that the problem exists and that you're not alone if you've been frustrated by it. In my follow-up, I'll provide actionable guidance for DBAs navigating cloud security alongside their cloud teams, including:

• Understanding the management plane vs. data plane split in practical terms

• Minimum permission combinations for common DBA tasks

• Necessary permission combinations for common user tasks at the frontend

• How to diagnose 'Access Denied' when the portal says you have access

• A checklist for DBAs before handing off to cloud teams (and vice versa)

What's Tripped You Up?

I'd love to hear from you. What cloud security issues have caught you off guard? What did you learn in your first SQL Server+Cloud 'Access Denied' surprise? Drop a comment if you can. Your pain points will help shape the follow-up.

More to Read:

Microsoft: Azure SQL Database Security Best Practices
Microsoft: Assign an Azure role for access to blob data
Azure SQL and Entra ID authentication, tips from the field

Thursday, January 22, 2026

SQL Server 2025: Current State

SQL Server 2025 went GA on November 18th. CU1 dropped on January 15th. Two months in production. Time to check the scoreboard.

I dug through Microsoft's known issues, read what the experts are saying, and tested a few things myself. These are my thoughts on the good, the broken, and where I'm on hold.

CU1: Don't Install It (Yet)

CU1 (KB5074901) landed a week ago with 16 fixes. It also breaks Database Mail.

Could not load file or assembly 'Microsoft.SqlServer.DatabaseMail.XEvents, 
Version=17.0.0.0' or one of its dependencies. 
The system cannot find the file specified.

Microsoft's official guidance: "If you use Database Mail and already downloaded this update, don't install it until a fix is available. If you already installed this update, uninstall it."   Nice. Welcome to CU1.


What Brent Ozar Says

Brent published his known issues roundup in December. His take: "Knock on wood, this is shaping up to be one of the better, more reliable releases so far."

High praise from someone who's watched a lot of SQL Server launches. His highlights:

Query Store + PSPO on readable secondaries = access violations. If you enable Query Store without disabling PSPO, you can crash the instance. Fix: disable PSPO.

Auditing won't write to the security log. Workaround: write to a file instead. Or, as Brent puts it, "if you need your auditing to be legally defensible, you need to use a third party appliance that sits in between SQL Server and the rest of the network, capturing all network packets." Agreed.

Full-text won't index documents larger than 25MB. Registry edit required to lift the limit.

Won't install without TLS 1.2. If your sysadmins disabled TLS 1.2 everywhere because 'it's deprecated', you'll need to have a discussion.

Won't install with more than 64 cores per CPU. Virtualization workaround: configure more sockets, fewer cores per socket.

PowerShell breaks with strict encryption. The irony: people proactive enough to use PowerShell are also the ones who enforce strict encryption.

SQL Auth logins are slower. SQL Server 2025 uses PBKDF2 for password hashing, otherwise known as 100,000 iterations of SHA-512. Security improvement and NIST compliance -- but every SQL auth adds ~150ms overhead. If you're using connection pooling, you may not notice. If you're not pooling and monitoring login latency, this will surface. VladDBA did the math: brute-forcing that took 45 seconds on 2022 now takes an estimated 154,008 seconds. This is not a small point.

SQLNCLI is gone. Microsoft followed through on a deprecation. The replacement, MSOLEDBSQL, enforces strict TLS certificate validation by default. If your linked servers were set up years ago with self-signed certs and 'trust me' settings, they will break. Options: fix your certificates properly, use TrustServerCertificate=Yes (less secure), or use sysadmin logins / Kerberos.


Aaron Bertrand's Upgrade Surprises

Aaron Bertrand shared three gotchas from real-world upgrades. None prevented the upgrade from completing. All could ruin your day.

1. Setup restarts your instance during the health check. Not during the upgrade, but during the pre-check. If you like to prep everything and click 'Upgrade' at an exact moment, you can't do this anymore. The moment you proceed past a certain point, your instance goes down. Aaron's words: "I can no longer advocate doing that as it is more intrusive than it used to be."

2. Full-text indexes must be rebuilt. If you don't, your full-text queries fail. There's a workaround — set FULLTEXT_INDEX_VERSION = 1 and copy legacy binaries from an older instance — but that's a bandaid. Plan rebuild time. One commenter on Brent's post reported that rebuilding crashes the instance with a memory dump. They have an open case with Microsoft.

3. Linked servers fail with TLS errors. Same MSOLEDBSQL issue as above. Same fix options. Plan ahead.


Vector Search: Cool, But Not Production-Ready

This is the headline feature. Microsoft wants SQL Server to be your vector database. Semantic queries without leaving T-SQL -- but it's not ready yet. Here's what the fine print tells us:

Still in preview. Even though 2025 is GA, you must enable PREVIEW_FEATURES to use vector indexes.

Table goes read-only. No inserts, no updates while the index exists. Data changes require dropping the index, modifying data, and rebuilding manually.

Predicates apply AFTER vector search. You can't filter first, then search. It searches everything, then filters. Scale implications are significant.

Other limitations: MAXDOP is ignored. Requires single-column INTEGER primary key. No partitioning. Not replicated to subscribers. DiskANN limited to Standard/Enterprise.

The read-only limitation alone is a showstopper for most production scenarios. Test it. Learn it. Don't go into it hastily.

This is where I'm on hold.


The Bottom Line

Two months in, SQL Server 2025 is okay. Better than some launches. No catastrophic data corruption bugs. No hard stop recalls. The known issues are annoying but manageable — except CU1. Don't install CU1. Yet.

If you're planning an upgrade:

  • Test in lower environments first. Aaron's team found all three issues before hitting production.
  • Check your linked servers. SQLNCLI is gone. TLS validation is enforced.
  • Inventory your full-text catalogs. Plan rebuild time.
  • Start your maintenance window early. Setup is more intrusive now.
  • Give it a couple more CUs. Let someone else find the next Database Mail.

Stay sharp. Verify everything. Trust nothing blindly.

More to Read:

What the LLM Doesn't Know About Your Database

Copilot has read more SQL documentation than you ever will. It has ingested Stack Overflow, Microsoft Learn, thousands of GitHub repos, and probably all those posts you bookmarked in 2019 and never finished reading.

It still doesn't know your database.

And that gap — between what an LLM learned and what it knows about you — is where things get interesting.

What the Training Data Contains

Large Language Models learn by processing massive datasets — the 'training data'. For coding assistants like Copilot, that includes:

  • Public documentation (Books Online, Microsoft Learn, Oracle docs)
  • Open-source repositories (GitHub, GitLab)
  • Q&A sites (Stack Overflow, DBA Stack Exchange)
  • Blog posts, tutorials, articles
  • Sample databases (AdventureWorks, Northwind, WideWorldImporters)

This is genuinely impressive. Ask Copilot to write a query that calculates running totals with a window function, and it'll nail it. Ask it to explain the difference between SNAPSHOT and READ COMMITTED SNAPSHOT isolation levels, and it'll do a decent job.

General knowledge? It's got plenty.

Your knowledge? Zero.

What the Training Data Doesn't Contain

Here's what wasn't in the dataset:

  • Your schema
  • Your table names
  • Your business logic
  • Why CustomerID is an INT in one table and a VARCHAR(20) in another
  • That dbo.Orders_OLD_DONOTUSE is actually still in use
  • The fact that Status = 3 means 'canceled' but Status = 7 also means 'canceled' because someone added it in 2017 during a migration and no one cleaned it up
  • Why you can't drop that index even though it very clearly is redundant
  • The stored procedure that runs fine in dev but brings production to its knees because of parameter sniffing

The LLM has never seen your environment. It has no context window that includes your Monday morning standup, your incident reports, or that Slack message from 2022 where someone explained why the ETL job has a 45-minute delay built in.

It's working from first principles and pattern matching. That's powerful — until it isn't.

The Hallucination Problem

When an LLM doesn't know something, it doesn't say 'I don't know'. It guesses and it assumes. Confidently. This is called a hallucination. Ask Claude or ChatGPT a question about your data that they cannot possibly answer. They will give you fluff and when you question it -- they will say they made an assumption. They will say they guessed. And they will apologize for it.

In conversation, hallucinations are awkward. In SQL, they're dangerous.

Here's an example. You ask Copilot:

"Write a query to get all active customers from the last 90 days"

It produces:

SELECT CustomerID, CustomerName, Email
FROM dbo.Customers
WHERE IsActive = 1
  AND CreatedDate >= DATEADD(DAY, -90, GETDATE());

Looks clean. Runs fine. Except:

  • Your table is called tblCustomer, not dbo.Customers
  • The column is ActiveFlag, not IsActive
  • 'Active' in your system means Status IN (1, 4, 6) -- it's not a boolean
  • You track activity by LastOrderDate, not CreatedDate

The LLM gave you a syntactically perfect solution for a database that is not yours. It hallucinated your schema, your column names, and your business rules — all with no questions. Pure assumption.

This is not a bug. This is how LLMs work. They predict the most likely next token based on patterns in the training data. When the actual answer isn't in the training data, they produce the most plausible-sounding alternative. Again, they assume. They guess. They 'make it up'.

Plausible isn't the same as correct.

Context Windows and Their Limits

Modern LLMs have a context window — the amount of text they can 'see' at once. GPT-4 can handle around 128K tokens. Claude can go higher. Sounds like a lot.

It's not enough.

Your production database probably has hundreds of tables, thousands of columns, stored procedures, views, functions, triggers -- and a decade of accumulated decisions. Even if you could paste your entire schema into the prompt, the model doesn't understand it the way you do. It sees text. It doesn't see relationships, intent or history.

And context windows reset. Every new conversation starts fresh. The LLM doesn't remember what you told it yesterday — unless you tell it again.

RAG: The Workaround (And Its Limits)

Retrieval-Augmented Generation — RAG — is the industry's current answer to this problem. Instead of relying solely on training data, RAG systems retrieve relevant documents at query time and inject them into the prompt.

In theory, you could point a RAG system at your schema, your documentation, your runbooks, and the LLM would then have context about your environment, rather than just generic SQL knowledge.

SQL Server 2025 even has native vector search to support this pattern — storing embeddings alongside your relational data so you can do semantic lookups without bolting on a separate vector database.

But RAG has limits too:

  • It only knows what you've documented
  • It can't retrieve knowledge that was never written down
  • It still hallucinates when the retrieval misses
  • It doesn't understand why — just what

You can feed it your schema. You can't feed it the hallway conversation from 2019 where someone explained why that column is nullable.

What This Means for DBAs

None of this means AI is useless. It means AI is a tool — and like any tool, it works better when you understand what it can and can't do.

Use it for syntax, not semantics

LLMs are great at 'how do I write a query that does X?' They're bad at 'should I write a query that does X?' That's still your job.

Always validate the output

Treat AI-generated SQL like code from a junior dev who's never seen your system. Review it. Test it. Never trust it blindly.

Provide context aggressively

The more you tell the LLM about your schema, your conventions, and your constraints -- the better the output. Paste in table definitions. Explain the business rules. Don't make it guess.

Document your environment

If you want AI to help future DBAs (or future you), write things down. The model can only retrieve what exists. That tribal knowledge in your head? Useless to a RAG system until you externalize it.

The Bottom Line

LLMs are trained on the world's public knowledge. Your database isn't public. Your business logic isn't in the training data. Your hard-won understanding of why things are the way they are — that's not something a neural network can learn from Stack Overflow.

The AI knows SQL. It doesn't know your SQL.

That's not a flaw in the technology. It's just the gap where your expertise lives. The LLM can help you work faster. It can't replace the context that makes your work valuable.

At least, not until someone figures out how to train a model on ten years of Slack messages, three org chart changes, and those really great whiteboard drawings that got erased in 2021.

I'm not holding my breath.

More to Read

Microsoft Learn: What's New in SQL Server 2025
sqlfingers: Death of the DBA (Again)
sqlfingers: New Year's Resolution for DBAs: Make Peace with Copilot

Who Trains the Senior DBAs of 2035?

Last week I wrote about the Death of the DBA (Again) and how AI, like every 'extinction event' before it, won't actually replace us. Thank you for reading. The responses were great! One anonymous comment really got my attention:

"What happens to the supply of junior DBAs when companies no longer have an incentive to hire them? Senior DBAs retire, and all that's left are the machines."

My response: "Very good question — and I don't have a tidy answer."

I've been thinking about it ever since. Not because I think we're doomed — but because this is the one problem AI can't solve for us. We have to.

The Numbers Are Already Moving

This isn't hypothetical doom-scrolling. It's happening.

According to SignalFire, which tracks job movements across 650 million LinkedIn profiles, new graduates made up just 7% of new hires at big tech companies in 2024. In 2023, that number was 25%.

A Stanford University study published in August 2025 found that the AI revolution is having a 'significant and disproportionate impact on entry-level workers in the U.S. labor market' — particularly 22- to 25-year-old software engineers.

Meanwhile, Anthropic CEO Dario Amodei has publicly warned that AI will eliminate half of junior white-collar jobs within five years.

So the trend line is clear. The question is whether we let it play out to an unknown extent — or we do something about it.

"One of the Dumbest Things I've Ever Heard"

Not everyone is buying what the AI hype machine is selling.

In August 2025, AWS CEO Matt Garman appeared on the Matthew Berman podcast and was asked about companies replacing junior staff with AI. His response was... direct:

"That is one of the dumbest things I've ever heard. They're probably the least expensive employees you have. They're the most leaned into your AI tools. How's that going to work when you go like 10 years in the future and you have no one that has built up or learned anything?"

He doubled down in December 2025 in an interview with WIRED:

"At some point that whole thing explodes on itself. If you have no talent pipeline that you're building and no junior people that you're mentoring and bringing up through the company, we often find that that's where we get some of the best ideas."

Garman runs the largest cloud infrastructure company on earth. He's not saying this because he's sentimental about new grads. He's saying it because he's done the math on what happens when you stop investing in people.

Spoiler: it explodes.

What Junior DBAs Actually Learn

Here's what Copilot can teach a junior DBA:

  • Syntax
  • Query patterns
  • How to Google faster

Congrats. You've trained a very expensive autocomplete.

Here's what Copilot can't teach:

  • Why the production database has that weird naming convention from 2012
  • Which developer to call when the nightly job fails (and which one to avoid)
  • That one table you never touch on Tuesdays because of the downstream dependencies no one will take ownership of
  • The instinct that something is wrong before the alerts fire
  • How to tell a VP 'no' without getting fired
  • What it feels like to bring a system back from the dead at 3 AM — and why you'll do what it takes to never do it again.

That knowledge transfers through proximity, mentorship, and supervised failure. You can't download it. You can't prompt-engineer it. You have to live it.

There's no training data for tribal knowledge. No neural network for judgment. That's not a bug in the model — it's the whole point.

The 2035 Math

Let's sketch this out.

I think the average DBA career spans about 25-30 years. If you entered the field in 2005-2010, you're now mid-career or approaching senior. If companies slow junior hiring now, the pipeline starts thinning immediately. By 2030, you feel it. By 2035, it's acute.

Year What Happens
2025 Junior hiring slows. AI handles 'easy' tasks.
2028 Mid-level shortage begins. Fewer people with 3-5 years experience.
2032 Senior DBAs start retiring. Replacements aren't ready.
2035 Salaries spike. No algorithm for institutional knowledge.

This isn't a prediction of doom. It's a prediction of opportunity — if you're on the right side of it.

What We Do About It

I'm not a policy maker. I'm a DBA. But here's what I know works:

Apprenticeship, not abandonment

Pair junior DBAs with seniors on real systems. Not sandboxes — production. Let them see what happens when a query goes sideways. Let them fix it with supervision. That's how judgment is built.

AI as training wheels, not a replacement

Use Copilot to accelerate learning, not skip it. A junior who uses AI to write a query and then has to explain why it's wrong learns more than one who just runs it and moves on.

Cross-training

Rotate new hires through development, operations, and DBA work. A DBA who has written application code understands why developers do what they do — and knows how to push back without starting a war. Speaking from experience: my time in the development layer was one of the biggest gains of my career. It changed how I see problems, how I communicate with dev teams, and honestly, how much I'm able to get done.

Write it down

That tribal knowledge everyone jokes about? Start documenting it. Not for the AI — for the humans who will need it when you're gone. Future you will thank present you. So will future them.

The Bottom Line

AI is not going to replace senior DBAs. We covered that last week.

But senior DBAs don't appear out of thin air. They come from junior DBAs who were given the chance to learn, fail, and grow. Cut off that pipeline, and in ten years we won't have a robot problem. We'll have a people problem.

The companies that figure this out — that keep hiring juniors, keep mentoring them, keep investing in the long game — will have senior DBAs in 2035.

The companies that don't? They'll be posting 'Senior DBA — URGENT' on LinkedIn and wondering why no one's applying.

Good luck with that.

More to Read

Entrepreneur: AWS CEO on Replacing Workers with AI
Fortune: AWS CEO Matt Garman on AI Displacing Junior Employees
IT Pro: AWS CEO on AI Replacing Software Developers
sqlfingers: Death of the DBA (Again)

Wednesday, January 21, 2026

Msg 8152, String or Binary Data Would Be Truncated

You're loading data. Maybe it's an ETL, maybe it's a migration... and this happens:

Msg 8152, Level 16, State 14, Line 222
String or binary data would be truncated.
The statement has been terminated.

The data doesn't fit. A value exceeds the column's defined length. But which column? Which row? The error doesn't say.

The Fix: Get a Better Error Message

SQL Server can tell you exactly which table, column, and data value caused the problem, like this:

Msg 2628, Level 16, State 1
String or binary data would be truncated in table 'dbo.Customers', column 'LastName'.
Truncated value: 'Wolfeschlegelsteinhausen'.

But to see this message rather than the useless one, you need one of the following:

Your Situation How to Get the Detailed Error
SQL Server 2019+ compat 150+ You already have it (default)
SQL Server 2019+ compat < 150 Change compatibility level or trace flag 460
SQL Server 2016 SP2 CU6+ / 2017 CU12+ Trace flag 460
Older Manual column comparison

Check Your Compatibility Level

If you're on SQL Server 2019 but still seeing the old Msg 8152, your database probably has a lower compatibility level (common after migrations):

SELECT name, compatibility_level FROM sys.databases WHERE name = 'YourDatabase';

If the result is below 150, that's your problem. To upgrade (Test first! This affects query optimizer behavior):

ALTER DATABASE YourDatabase SET COMPATIBILITY_LEVEL = 150;

Enable Trace Flag 460

If you're on SQL Server 2016/2017 and cannot change the compatibility level, you can enable trace flag 460. This flag replaces the generic truncation message with an enhanced message including the tablename, column and data value that caused the truncation. Huge improvement to Microsoft's vague error messages.

You can enable Trace Flag 460 temporarily on the single query that is failing OR you can set it permanently at the server level. This is my preference. Leaving Trace Flag 460 enabled has negligible overhead and is considered safe for continuous use.

** This flag is not needed v2019+ with compatibility 150 or higher, and there was a known bug for this trace flag on v2017 CU13, but this was corrected in CU14 (KB4490142).

Option 1: Single Query

INSERT INTO Customers (FirstName, LastName, Email)
SELECT FirstName, LastName, Email FROM StagingTable
OPTION (QUERYTRACEON 460);

Option 2: Enable Permanently at Server Level

  1. Open SQL Server Configuration Manager (run as Administrator)
  2. In the left pane, select SQL Server Services
  3. Right-click your SQL Server instance and select Properties
  4. Go to the Startup Parameters tab
  5. In the "Specify a startup parameter" box, type: -T460
    (uppercase T, no space)
  6. Click Add, then Apply, then OK
  7. Restart the SQL Server service for the change to take effect

Verify the Flag is Active

After the service restart, verify the flag is enabled with this query in SSMS:

DBCC TRACESTATUS(460, -1);
GO

DEMO

-- Create a table with a narrow column
CREATE TABLE TruncDemo (
    ID INT,
    ProductName VARCHAR(10)
);

-- Try to insert data that's too long
INSERT INTO TruncDemo (ID, ProductName)
VALUES (1, 'Industrial Widget Pro Max');
Msg 8152, Level 16, State 14, Line 8
String or binary data would be truncated.
The statement has been terminated.

With trace flag 460:

-- Enable the flag and try again
INSERT INTO TruncDemo (ID, ProductName)
VALUES (1, 'Industrial Widget Pro Max')
OPTION (QUERYTRACEON 460);
Msg 2628, Level 16, State 1, Line 8
String or binary data would be truncated in table 'DBA.dbo.TruncDemo', column 'ProductName'. Truncated value: 'Industrial'.
The statement has been terminated.

Now you know the table, column and exact data value that didn't fit.

Older Versions

Regrettably, there is no built-in help. You need to compare source and target column lengths manually:

SELECT 
    c.name ColumnName,
    t.name DataType,
    c.max_length
FROM sys.columns c JOIN sys.types t 
  ON c.user_type_id = t.user_type_id
WHERE object_id = OBJECT_ID('YourTargetTable')
ORDER BY c.column_id;

AI Tools

If you paste the error message, failing INSERT and table definitions into Claude or ChatGPT, they will identify the offending column/data value that produced the truncation. No special technique — just show the AI the schema and the error.

More to Read

Microsoft: Trace Flags
Brent Ozar: How to Fix the Error
Microsoft Tech Community: Replacing the Infamous Error 8152

Tuesday, January 20, 2026

I/O Requests Taking Longer Than 15 Seconds -- Now What?

In a previous post, I shared a script to detect the I/O requests taking longer than 15 seconds warning across your SQL Server inventory.  Now let's talk about what to do when you find it.

Here are five of the most common causes with some tips to investigate each:

1. Storage Subsystem Bottlenecks

This is the most common cause. Your SAN, NAS, or local disks are overloaded, hitting IOPS limits, experiencing network/iSCSI latency, or suffering from VM I/O throttling ('noisy neighbors'). Driver/firmware issues and even the Windows power plan can contribute.

Check file-level latency from inside SQL Server:

SELECT
    DB_NAME(fs.database_id) database_name,
    mf.name logical_name,
    mf.physical_name,
    CASE WHEN num_of_reads = 0 THEN 0 
         ELSE (io_stall_read_ms / num_of_reads) END avg_read_latency_ms,
    CASE WHEN num_of_writes = 0 THEN 0 
         ELSE (io_stall_write_ms / num_of_writes) END avg_write_latency_ms
FROM sys.dm_io_virtual_file_stats(NULL, NULL) fs INNER JOIN sys.master_files mf 
  ON fs.database_id = mf.database_id 
  AND fs.file_id = mf.file_id
ORDER BY (io_stall_read_ms + io_stall_write_ms) DESC;

Read latency consistently over 20ms or write latency over 10ms indicates a problem. For transaction logs, you want sub-millisecond writes. Also check Perfmon counters Avg. Disk sec/Read and Avg. Disk sec/Write for the affected volumes. If latencies are high at the OS level, pull in your storage/infrastructure team to review further.

CAUSE: External — not SQL Server.

2. Tempdb Contention

Tempdb gets special treatment because it's so critical.  Causes include too few data files, tempdb on slow storage, heavy spills from sorts/hashes, or version store pressure.

Check for PAGELATCH waits on tempdb allocation pages:

SELECT 
    session_id,
    wait_type,
    wait_time,
    wait_resource
FROM sys.dm_exec_requests
WHERE wait_type LIKE 'PAGELATCH%'
  AND wait_resource LIKE '2:%' --- The '2:' prefix means database_id 2, which is tempdb.
ORDER BY wait_time DESC;

Check tempdb file count vs. CPU cores:

SELECT 
    COUNT(*) tempdb_data_files,
    (SELECT cpu_count FROM sys.dm_os_sys_info) logical_cpus
FROM sys.master_files
WHERE database_id = 2 AND type = 0;

If you have fewer tempdb data files than logical CPUs (up to 8), and you're seeing PAGELATCH_UP or PAGELATCH_EX waits on pages like 2:1:1 (PFS) or 2:1:3 (SGAM), consider adding more equally-sized tempdb data files.

CAUSE: SQL Server configuration.

3. File Autogrowth Delays

When data or log files auto-grow — especially if IFI (Instant File Initialization) isn't enabled — Windows has to zero-fill the new space. This can freeze I/O for seconds or longer.

Check if IFI is enabled:

SELECT 
    servicename,
    instant_file_initialization_enabled
FROM sys.dm_server_services
WHERE servicename LIKE 'SQL Server (%';

Check current autogrowth settings:

SELECT 
    DB_NAME(database_id) database_name,
    name logical_name,
    type_desc,
    CASE WHEN is_percent_growth = 1 
         THEN CAST(growth AS VARCHAR) + '%'
         ELSE CAST((growth * 8) / 1024 AS VARCHAR) + ' MB' 
    END autogrowth_setting,
    (size * 8) / 1024 current_size_mb
FROM sys.master_files
--WHERE database_id > 4 -- uncomment this to look at ALL database files
ORDER BY database_id, type;

IFI should be enabled — it's one of my default recommendations on any SQL Server build that does not have TDE (Transparent Data Encryption) enabled. Autogrowth settings should be in MB, not percentages, and sized appropriately.  Small increments cause frequent grows; huge increments cause long freezes.

CAUSE: SQL Server configuration.

4. Antivirus, Filter Drivers & Maintenance Conflicts

Real-time AV scanning of MDF/NDF/LDF files, VSS snapshots, encryption software, backup agents, or index rebuilds and DBCC checks competing for I/O during peak hours can all trigger the warning.

List filter drivers attached to your SQL Server volumes (run from an elevated command prompt):

fltmc instances

Look for antivirus drivers (altitudes 320000-329998) attached to your data volumes.

Correlate warning times with maintenance jobs:

SELECT 
    j.name job_name,
    h.step_name,
    h.run_date,
    h.run_time,
    h.run_duration
FROM msdb.dbo.sysjobhistory h INNER JOIN msdb.dbo.sysjobs j 
  ON h.job_id = j.job_id
WHERE h.run_status = 1
AND h.run_date >= CONVERT(INT, CONVERT(VARCHAR(8), DATEADD(DAY, -7, GETDATE()), 112))
ORDER BY h.run_date DESC, h.run_time DESC;

If the 15-second warnings correlate with backup jobs, index rebuilds, or DBCC CHECKDB, you should research whether they can be rescheduled. If AV scanning is active on your data volumes, work with your security team to exclude SQL Server files.

CAUSE: External / SQL Server scheduling.

5. Query/Workload Pressure

Missing indexes, large scans, or memory pressure forcing excessive physical reads can overwhelm even healthy storage.

Check wait stats for I/O-related waits:

SELECT TOP 10
    wait_type,
    wait_time_ms / 1000.0 wait_time_sec,
    (wait_time_ms - signal_wait_time_ms) / 1000.0 resource_wait_sec,
    waiting_tasks_count
FROM sys.dm_os_wait_stats
WHERE wait_type IN (
	'PAGEIOLATCH_SH', 'PAGEIOLATCH_EX', 'WRITELOG', 'IO_COMPLETION', 'ASYNC_IO_COMPLETION'
    )
ORDER BY wait_time_ms DESC;

Check for missing indexes:

SELECT TOP 10
    DB_NAME(d.database_id) database_name,
    OBJECT_NAME(d.object_id, d.database_id) table_name,
    d.equality_columns,
    d.inequality_columns,
    d.included_columns,
    s.user_seeks,
    s.avg_user_impact
FROM sys.dm_db_missing_index_details d INNER JOIN sys.dm_db_missing_index_groups g 
  ON d.index_handle = g.index_handle INNER JOIN sys.dm_db_missing_index_group_stats s 
    ON g.index_group_handle = s.group_handle
WHERE d.database_id > 4
ORDER BY (s.user_seeks * s.avg_user_impact) DESC;

High PAGEIOLATCH_SH waits often indicate queries doing more physical reads than necessary.  This might be a memory issue (not enough buffer pool) or an indexing issue (scans instead of seeks).  Missing indexes with high user_seeks and avg_user_impact are good candidates for creation.

CAUSE: Workload / SQL Server configuration.

Bottom Line: Is it SQL Server or Not?

Run the diagnostics above.  If your file latencies (from sys.dm_io_virtual_file_stats) are high AND Perfmon shows healthy disk response times, the problem is very likely inside SQL Server — configuration, queries, or scheduling.

If both SQL Server and the OS agree that disks are slow, approach your infrastructure team with this evidence.  The 15-second warning is SQL Server telling you something is wrong.  Your job is to figure out which side of the fence the problem lives on.

More to Read

5 SQL Server Misconfigurations Hiding in Plain Sight

A few weeks back, a customer reached out to me:

We recently had a performance issue with the Web application that may have been tied to database tuning. Could you please take a quick look at the sql server and see if it there are any "red flags" for performance or tuning, or opportunities to improve performance?

I ran a fast health check that afternoon. Within less than one hour, I had the answer — five configuration issues, most of them out-of-the-box defaults that had never been touched, all of them silently compounding since the day the server was built.

Here's what I found:

1. Max Server Memory: Never configured.
The setting was still at 2,147,483,647 MB — roughly 2 terabytes. The server had 12GB of physical RAM. Because it wasn't given a cap, SQL Server was continually trying to claim more than it had to give, starving the OS and anything else trying to run on that box. I recommended setting it to 8GB.

2. Power Plan: Balanced mode.
The server had 2.10GHz processors, but Windows was throttling them to save energy -- on a database server. 🤔 I can assure you, SQL Server wants and needs high performance. I recommended switching to High Performance.

3. TempDB: Undersized and misconfigured.
Four data files at 392MB each, a 72GB log file, and all of them set to grow in 64MB increments. This is a bottleneck waiting to happen. I recommended adding 4 more datafiles and resizing all files to 2GB each with appropriate filegrowth settings.

4. Cost Threshold for Parallelism: Default of 5.
This configuration setting hasn't been appropriate for most workloads in over two decades. I recommended changing it to 60.

5. Transaction log larger than the data.
The user database had a 3.3GB transaction log and a 813MB data file on a database in SIMPLE recovery mode. That's a sign something went sideways at some point and was never corrected.

Beyond the configuration issues, I also found evidence of serious plan cache inefficiency: one query alone had generated over 1,000 different execution plans, and there were 14,000+ single-use plans consuming memory that should have been caching data.

But here's the thing — those kinds of symptoms are almost expected when the underlying server configuration is this far off. You can't get a clear read on plan cache behavior when the server is starved for memory and throttled at the CPU. Fix the foundation first, let it burn in, then revisit the plan cache for inefficiencies.

All of the above fixes were applied in a few minutes collectively without any interruption.


Two weeks later, I checked back in.

The server looked better across the board. That query with 1,008 cached plans? Down to 857 — and we hadn't touched the code. The plan cache was behaving more predictably now that the server had proper resources to work with.

The remaining issues I flagged — cursor usage, implicit conversions, key lookups, 20,000+ ad hoc plans — were application-level concerns, not configuration. Now we knew exactly where to look next, and we had a stable baseline to measure against.

My final recommendation: enable optimize for ad hoc workloads, clean up the mass of user tables that had accumulated in the master database, and move forward with a cleaner foundation.


This is what a health check does.

It's not magic. It's experience and knowing where to look. Many SQL Server environments have issues like these — settings that were never tuned, defaults that don't make sense for production workloads, small problems that compound quietly over the years before they surface.

You don't know they're there until someone looks.

If your SQL Server has never had a health check, or if it's been a while - let's talk. I do this remotely, I do it fast, and I will find the things that have been silently costing you since day one.

Thursday, January 15, 2026

SQL Server Error 9002: Your Transaction Log Is Full

It's 2am. Your phone wakes you. Rub your eyes, check your email, and there it is:

Error: 9002, Severity: 17, State: 4
The transaction log for database 'trading' is full due to 'LOG_BACKUP'.

The database is still online. Looks ok. You can read from it. But every INSERT, UPDATE, and DELETE fails. Production night-trading is effectively down.

The good news: It's fixable -- but, that fix depends entirely on what's preventing log truncation.

Step 1: Find Out Why

Run this:

SELECT name, log_reuse_wait_desc 
FROM sys.databases 
WHERE name = 'trading';

The log_reuse_wait_desc column tells you exactly what's blocking truncation. Here are some different results you may see and how to respond:

log_reuse_wait_desc = LOG_BACKUP

You're in FULL or BULK_LOGGED recovery and the log hasn't been backed up recently. Check when the last successful log backup ran:

SELECT database_name, backup_finish_date, backup_size
FROM msdb.dbo.backupset
WHERE type = 'L' 
AND database_name = 'trading'
ORDER BY backup_finish_date DESC;

If backups are running but this still happens, either they're failing (check job history) or the frequency isn't keeping up with your transaction volume. Take a log backup now:

BACKUP LOG YourDatabase TO DISK = 'D:\Backups\trading_log.trn';

Then increase backup frequency if needed.

log_reuse_wait_desc = ACTIVE_TRANSACTION

Something's holding a transaction open. Find it:

DBCC OPENTRAN('trading');

You'll get the SPID of the oldest open transaction. Investigate with DBCC INPUTBUFFER(spid) to see what it's running. If it's abandoned, kill it. If it's legitimate, wait or have a conversation with the application team about transaction scope. It could be anything. Recently we had a CDC job that was running long -- log_reuse_wait was CDC.

log_reuse_wait_desc = REPLICATION

The Log Reader Agent isn't keeping up, or there's orphaned replication metadata. The fastest path to check status is the Replication Monitor:

• In SSMS, right-click the Replication folder in Object Explorer and select Launch Replication Monitor.

• Expand the Publisher and publication groups to see the status of all agents and subscriptions. The 'All Subscriptions' tab displays details, including 'Undistributed commands' — that's your backlog.

log_reuse_wait_desc = AVAILABILITY_REPLICA

Your AG secondary isn't keeping up. The primary can't truncate the log until it's been delivered, hardened, and applied on all secondaries. Check the queue on the primary and/or the redo waits on the secondary:

-- run on primary
SELECT 
    db.name AS database_name,
    drs.synchronization_state_desc,
    drs.log_send_queue_size AS log_send_queue_kb,
    drs.redo_queue_size AS redo_queue_kb,
    drs.redo_rate,
    drs.last_sent_time,
    drs.last_redone_time
FROM sys.dm_hadr_database_replica_states drs JOIN sys.databases db 
  ON drs.database_id = db.database_id
WHERE drs.is_local = 0
ORDER BY drs.redo_queue_size DESC;

-- run on secondary to check for redo thread waits
SELECT * FROM sys.dm_exec_requests 
WHERE command LIKE '%REDO%' OR command LIKE '%HADR%'
ORDER BY start_time DESC;

High log_send_queue_size means network latency or throughput issues. High redo_queue_size means the secondary can't apply changes fast enough — check for resource bottlenecks or blocking on that replica. If a secondary is way behind, you have a performance problem on that replica.

NOTHING

If it says NOTHING, the log should truncate on the next checkpoint or log backup. If you're still out of space, you may just need to shrink:

DBCC SHRINKFILE(trading_log, 1024);

Yes, I know shrinking is generally bad. At 2AM with a full log, you may feel differently.

Prevention

If you need point-in-time recovery, then you need to be FULL Recovery Model and you need transaction log backups. Period. Every 15-30 minutes often works well for busy databases, but the frequency is up to you. If you don't need point-in-time recovery, make things easy. Switch to SIMPLE Recovery Model. Problem solved.

More to Read

SQL Server 2025 Upgrade: Three Errors That Will Ruin Your Day - How to Fix Them?

SQL Server 2025 went GA in November. The upgrades have begun... and the error messages are rolling in.

I've been tracking what's biting people out there, and the same three issues keep coming up — none are showstoppers, but all of them will derail your upgrade if you're not prepared.

Here's what to watch for, and how to fix it.


1. Full-Text Search Breaks Immediately

This one's nasty because it happens silently. Your upgrade completes successfully, everything looks fine — and then your full-text queries start failing:

Msg 30010, Level 16, State 2
An error has occurred during the full-text query. Common causes include: 
word-breaking errors or timeout, FDHOST permissions/ACL issues, service 
account missing privileges, malfunctioning IFilters, communication channel 
issues with FDHost and sqlservr.exe, etc.

huh?

Why this happens: SQL Server 2025 completely rebuilt the full-text search components with a modern toolset. The old word breaker and filter binaries are gone. Your existing indexes are still marked as 'version 1' — but the binaries they need no longer exist on disk.

The fix: Rebuild your full-text catalogs to use the new version 2 components.

-- Check your current index version setting
SELECT [name], [value]
FROM sys.database_scoped_configurations
WHERE [name] = 'FULLTEXT_INDEX_VERSION';

-- Rebuild the catalog (this upgrades indexes to version 2)
ALTER FULLTEXT CATALOG [YourCatalogName] REBUILD;

The catch: Rebuilding large catalogs is time-consuming and will hit your CPU and I/O. Test this in a lower environment first, and don't do it during peak hours.

If you absolutely must stay on version 1 for compatibility reasons, you can copy the legacy binaries from an older instance — but that's a band-aid, not a solution. Details in Microsoft's breaking changes documentation.


2. Linked Servers Fail with TLS Errors (this is a good one)

After upgrading, your linked server queries start throwing errors like confetti:

Msg 7303, Level 16, State 1
Cannot initialize the data source object of OLE DB provider "MSOLEDBSQL" 
for linked server "YourLinkedServer".
TCP Provider: The certificate chain was issued by an authority that is not trusted.

Or this gem:

Msg 17832, Level 20, State 18
The login packet used to open the connection is structurally invalid; 
the connection has been closed.

Why it happens: SQL Server 2025 uses MSOLEDBSQL 19, which enforces strict certificate validation by default. The old 'just trust me' behavior is gone. If your linked servers were set up with self-signed certificates or no certificate validation, they will fail.

The proper fix: Configure your servers with proper certificates — either from a public CA or your internal certificate authority. This is Microsoft's recommended approach and the most secure option.

The quick-and-dirty fix: If you need things working NOW and will address certificates later, you can override the secure default:

-- For existing linked servers, recreate with TrustServerCertificate
EXECUTE sp_addlinkedserver 
    @server = N'YourLinkedServer', 
    @srvproduct = N'', 
    @provider = N'MSOLEDBSQL', 
    @provstr = N'encrypt=mandatory;trustservercertificate=yes', 
    @datasrc = N'YourLinkedServer';

Yes, this is less secure. Yes, you should fix it properly. But sometimes you need to stop the bleeding before you can address the root cause.

Bonus pain: This also affects replication if you have a remote distributor. Same root cause, same fix options. See Microsoft's documentation for the replication-specific stored procedures.


3. Setup Restarts Your Instance During the Health Check

This one is more of a 'gotcha' than an error — but it can absolutely ruin your day if you're not expecting it.

When upgrading from SQL Server 2022 to 2025, the setup wizard restarts your SQL Server service during the health check phase. Not during the actual upgrade. During the pre-check.

If you're like me, you like to prep everything, answer all the dialogs, and then pause at the final step so you can click 'Upgrade' at exactly the right moment during your maintenance window.

You can't do that anymore. The moment you proceed past a certain point in the wizard, your instance goes down — even if you haven't committed to the upgrade yet.

Why it happens: The health check needs to evaluate certain rules that require stopping and starting the service. Microsoft's logs show it's checking things like Engine_IsLPIMEnabledForX64.

The fix: Plan for it. Your maintenance window starts when you launch the upgrade wizard, not when you click the final button. Communicate this to your stakeholders.

From Aaron Bertrand's recent post on this:

"I can no longer advocate doing that as it is more intrusive than it used to be."


Before You Upgrade

A quick checklist:

Full-Text Search: Inventory your full-text catalogs. Plan rebuild time. Test in dev first.
Linked Servers: Check your certificate situation. Fix it properly or plan for the workaround.
Maintenance Window: Start it earlier than you think. The wizard is more intrusive now.
Read the docs: Microsoft's known issues page and breaking changes documentation are worth your time.

SQL Server 2025 is solid. But like every major version, the upgrade path has some surprises. Better to know about them now than after you've begun the upgrade.


More to Read

Known Issues So Far in SQL Server 2025 — Brent Ozar
sp_addlinkedserver (Transact-SQL) — Microsoft Docs