I read an interview yesterday on DataStoryteller with Kelly, a data scientist and author of A Friendly Guide to Data Science. They covered a lot of ground — career paths, soft skills, ethics — but one line got my attention instantly:
"Generative AI is powerful but overhyped — AI cannot replace data quality, context, or human judgment."
That's it! That's the whole thing I keep bumping into. I want to try to show you exactly what Kelly means.
Let's Set the Stage
Open SSMS, connect to your working database, and run this to create some sample data:
CREATE TABLE dbo.Orders (
OrderID INT IDENTITY(1,1),
CustomerName VARCHAR(100),
OrderTotal DECIMAL(10,2),
OrderDate DATE
);
INSERT INTO dbo.Orders (CustomerName, OrderTotal, OrderDate) VALUES
('Acme Corp', 15000.00, '2024-02-15'),
('ACME Corporation', 22000.00, '2024-03-10'),
('Acme Corp.', 8500.00, '2024-05-22'),
('ACME CORP', 31000.00, '2024-07-08'),
('Globex Industries', 45000.00, '2024-01-20'),
('Globex Industries', 52000.00, '2024-06-14'),
('Initech', 12000.00, '2024-04-03'),
('DO NOT USE - TEST', 99999.00, '2024-08-01'),
('Test Customer', 50000.00, '2024-09-15'),
(NULL, 18000.00, '2024-02-28'),
(NULL, 23000.00, '2024-05-11'),
(NULL, 7500.00, '2024-10-05');
Now open the GitHub Copilot Chat window (View > GitHub Copilot Chat) and ask it:
"Write me a query for total revenue by customer for 2024 from dbo.Orders"
Here's my Copilot's answer:
Look at that response. Copilot explored the database, confirmed the table exists and validated the syntax. It wrote sophisticated code with ISNULL(NULLIF(LTRIM(RTRIM(...)))) to handle NULLs and empty strings. It added a @TopN variable I didn't ask for, and even threw in an OrderCount for good measure.
But look at the bottom. A polite acknowledgement that name variants like "Acme Corp" and "ACME CORPORATION" might need normalization - but it doesn't actually fix the problem. It just mentions it.
Let's run it and see what we get:
-- Created by GitHub Copilot in SSMS - review carefully before executing
DECLARE @TopN INT = 10;
SELECT TOP (@TopN) WITH TIES
ISNULL(NULLIF(LTRIM(RTRIM(CustomerName)), ''), '(Unknown)') AS CustomerName,
SUM(OrderTotal) AS TotalRevenue,
COUNT(*) AS OrderCount
FROM dbo.Orders
WHERE OrderDate >= '20240101' AND OrderDate < '20250101'
GROUP BY ISNULL(NULLIF(LTRIM(RTRIM(CustomerName)), ''), '(Unknown)')
ORDER BY TotalRevenue DESC;
The Problem is Hiding in Plain Sight
Look at row 1. Your #1 customer by revenue is "DO NOT USE - TEST" at $99,999. And row 3? "Test Customer" at $50,000. Two test accounts in your top 3.
Now look at rows 5, 6, and 8: "ACME CORP", "ACME Corporation", "Acme Corp." — the same customer split across three rows, their combined $76,500 fragmented into pieces... as if they were three different customers.
Let's verify:
-- How many versions of "Acme" do we have?
SELECT DISTINCT CustomerName
FROM dbo.Orders
WHERE CustomerName LIKE '%Acme%';
-- How much revenue is hiding in NULL customers?
SELECT COUNT(*) AS NullCustomerOrders,
SUM(OrderTotal) AS MissingRevenue
FROM dbo.Orders
WHERE CustomerName IS NULL;
-- How much 'fake' revenue from test accounts?
SELECT CustomerName, SUM(OrderTotal) AS TotalRevenue
FROM dbo.Orders
WHERE CustomerName LIKE '%TEST%' OR CustomerName LIKE '%DO NOT%'
GROUP BY CustomerName;
Three versions of Acme. $48,500 in NULL revenue (which Copilot did handle, to be fair), and $149,999 in test account revenue polluting the results.
The Point
The AI wrote syntactically perfect SQL. It even tried to be clever with NULL handling, AND it acknowledged the duplicate name problem in its response. Yet it still produced a report that's quietly, confidently wrong — because it doesn't know your data the way you do. It can't attend your Monday meetings. It doesn't know that Karen in accounting enters everything in ALL CAPS. It is trusting everything blindly.
Kelly nailed this one too: "Garbage in, garbage out is very common — most real-world data is not analysis-ready."
The bottleneck was never writing the query. The bottleneck comes with understanding what the data actually means.
The Bottom Line
Use Copilot. Use ChatGPT. These tools are genuinely useful — they reduce friction, eliminate syntax errors and save a lot of time.
But before you hit send, run a few sanity checks. CHECK it before you email it to your customer. Poke at the data. Look for dupes, NULLs, and test records. Garbage in = garbage out has been true since the dawn of time — and it will ALWAYS be true. If you're not cautious, AI just helps you push that garbage out faster.
AI doesn't know what your data means.
You do.
More to Read: