Thursday, January 22, 2026

What the LLM Doesn't Know About Your Database

Copilot has read more SQL documentation than you ever will. It has ingested Stack Overflow, Microsoft Learn, thousands of GitHub repos, and probably all those posts you bookmarked in 2019 and never finished reading.

It still doesn't know your database.

And that gap — between what an LLM learned and what it knows about you — is where things get interesting.

What the Training Data Contains

Large Language Models learn by processing massive datasets — the 'training data'. For coding assistants like Copilot, that includes:

  • Public documentation (Books Online, Microsoft Learn, Oracle docs)
  • Open-source repositories (GitHub, GitLab)
  • Q&A sites (Stack Overflow, DBA Stack Exchange)
  • Blog posts, tutorials, articles
  • Sample databases (AdventureWorks, Northwind, WideWorldImporters)

This is genuinely impressive. Ask Copilot to write a query that calculates running totals with a window function, and it'll nail it. Ask it to explain the difference between SNAPSHOT and READ COMMITTED SNAPSHOT isolation levels, and it'll do a decent job.

General knowledge? It's got plenty.

Your knowledge? Zero.

What the Training Data Doesn't Contain

Here's what wasn't in the dataset:

  • Your schema
  • Your table names
  • Your business logic
  • Why CustomerID is an INT in one table and a VARCHAR(20) in another
  • That dbo.Orders_OLD_DONOTUSE is actually still in use
  • The fact that Status = 3 means 'canceled' but Status = 7 also means 'canceled' because someone added it in 2017 during a migration and no one cleaned it up
  • Why you can't drop that index even though it very clearly is redundant
  • The stored procedure that runs fine in dev but brings production to its knees because of parameter sniffing

The LLM has never seen your environment. It has no context window that includes your Monday morning standup, your incident reports, or that Slack message from 2022 where someone explained why the ETL job has a 45-minute delay built in.

It's working from first principles and pattern matching. That's powerful — until it isn't.

The Hallucination Problem

When an LLM doesn't know something, it doesn't say 'I don't know'. It guesses and it assumes. Confidently. This is called a hallucination. Ask Claude or ChatGPT a question about your data that they cannot possibly answer. They will give you fluff and when you question it -- they will say they made an assumption. They will say they guessed. And they will apologize for it.

In conversation, hallucinations are awkward. In SQL, they're dangerous.

Here's an example. You ask Copilot:

"Write a query to get all active customers from the last 90 days"

It produces:

SELECT CustomerID, CustomerName, Email
FROM dbo.Customers
WHERE IsActive = 1
  AND CreatedDate >= DATEADD(DAY, -90, GETDATE());

Looks clean. Runs fine. Except:

  • Your table is called tblCustomer, not dbo.Customers
  • The column is ActiveFlag, not IsActive
  • 'Active' in your system means Status IN (1, 4, 6) -- it's not a boolean
  • You track activity by LastOrderDate, not CreatedDate

The LLM gave you a syntactically perfect solution for a database that is not yours. It hallucinated your schema, your column names, and your business rules — all with no questions. Pure assumption.

This is not a bug. This is how LLMs work. They predict the most likely next token based on patterns in the training data. When the actual answer isn't in the training data, they produce the most plausible-sounding alternative. Again, they assume. They guess. They 'make it up'.

Plausible isn't the same as correct.

Context Windows and Their Limits

Modern LLMs have a context window — the amount of text they can 'see' at once. GPT-4 can handle around 128K tokens. Claude can go higher. Sounds like a lot.

It's not enough.

Your production database probably has hundreds of tables, thousands of columns, stored procedures, views, functions, triggers -- and a decade of accumulated decisions. Even if you could paste your entire schema into the prompt, the model doesn't understand it the way you do. It sees text. It doesn't see relationships, intent or history.

And context windows reset. Every new conversation starts fresh. The LLM doesn't remember what you told it yesterday — unless you tell it again.

RAG: The Workaround (And Its Limits)

Retrieval-Augmented Generation — RAG — is the industry's current answer to this problem. Instead of relying solely on training data, RAG systems retrieve relevant documents at query time and inject them into the prompt.

In theory, you could point a RAG system at your schema, your documentation, your runbooks, and the LLM would then have context about your environment, rather than just generic SQL knowledge.

SQL Server 2025 even has native vector search to support this pattern — storing embeddings alongside your relational data so you can do semantic lookups without bolting on a separate vector database.

But RAG has limits too:

  • It only knows what you've documented
  • It can't retrieve knowledge that was never written down
  • It still hallucinates when the retrieval misses
  • It doesn't understand why — just what

You can feed it your schema. You can't feed it the hallway conversation from 2019 where someone explained why that column is nullable.

What This Means for DBAs

None of this means AI is useless. It means AI is a tool — and like any tool, it works better when you understand what it can and can't do.

Use it for syntax, not semantics

LLMs are great at 'how do I write a query that does X?' They're bad at 'should I write a query that does X?' That's still your job.

Always validate the output

Treat AI-generated SQL like code from a junior dev who's never seen your system. Review it. Test it. Never trust it blindly.

Provide context aggressively

The more you tell the LLM about your schema, your conventions, and your constraints -- the better the output. Paste in table definitions. Explain the business rules. Don't make it guess.

Document your environment

If you want AI to help future DBAs (or future you), write things down. The model can only retrieve what exists. That tribal knowledge in your head? Useless to a RAG system until you externalize it.

The Bottom Line

LLMs are trained on the world's public knowledge. Your database isn't public. Your business logic isn't in the training data. Your hard-won understanding of why things are the way they are — that's not something a neural network can learn from Stack Overflow.

The AI knows SQL. It doesn't know your SQL.

That's not a flaw in the technology. It's just the gap where your expertise lives. The LLM can help you work faster. It can't replace the context that makes your work valuable.

At least, not until someone figures out how to train a model on ten years of Slack messages, three org chart changes, and those really great whiteboard drawings that got erased in 2021.

I'm not holding my breath.

More to Read

Microsoft Learn: What's New in SQL Server 2025
sqlfingers: Death of the DBA (Again)
sqlfingers: New Year's Resolution for DBAs: Make Peace with Copilot

No comments:

Post a Comment