You have a customer list. Somewhere in it, 'Globex Corporation' got entered four different ways by four different people at four different times. One typed 'Globx Corporation'. One typed 'Globex Corporaton'. One held the shift key down and gave you 'GLOBEX CORPORATION'. They are all the same company, but your reports are counting them as four different entities.
SQL Server never had a good built-in answer for this. SOUNDEX has been in the box forever, but it was built to match names that sound alike, not to measure how close two typed strings actually are -- a different job entirely. So, we either hand-wrote the matching logic ourselves or pushed the whole problem out to SSIS Fuzzy Lookup or CLR. SQL Server 2025 finally ships the math itself. Four functions, no UDF, no external code. This post is the demo: the old pain, the four new functions, a real dedup query, and the one gotcha that may bite you on day one.
The Old Way: Roll Your Own
The closest thing we had built in was SOUNDEX, and it was never up to this job. It is phonetic, it is English-centric, and it crushes every word down to a four-character code, so collisions are constant. It can tell you two words sound alike. It cannot tell you that 'Globex Corporaton' is one keystroke from 'Globex Corporation'.
-- The tool we used to reach for. It answers a different question than the one you have.
SELECT SOUNDEX('Globex'), SOUNDEX('Globx');
So most shops did one of two things. They hand-wrote a function to count the differences between two strings and ran it across a million rows -- the per-row, parallelism-killing scalar UDF curse I wrote about earlier this month. Or they gave up on T-SQL entirely and moved the work to SSIS or a CLR assembly. Both work, but neither is something you want to manage as an 'extra' in your data layer.
Turn It On First
Two things have to be true before the four new functions can be used. The database has to be at the SQL Server 2025 compatibility level (170), and the preview-features database scoped configuration has to be on.
-- Check where you are SELECT name, compatibility_level FROM sys.databases WHERE name = DB_NAME(); -- Set the 2025 compatibility level if you are not there yet ALTER DATABASE [YourDbName] SET COMPATIBILITY_LEVEL = 170; -- Turn on the preview features for this database ALTER DATABASE SCOPED CONFIGURATION SET PREVIEW_FEATURES = ON;
Skip that last step and SQL Server does not tell you what you missed. It just acts like the function does not exist, because as far as it is concerned, it doesn't:
Msg 195, Level 15, State 10, Line 1 'EDIT_DISTANCE' is not a recognized built-in function name.
If you see that on SQL Server 2025, your first check is whether the preview-features setting is ON.
USE DBA; -- Change to the db where you've enabled it.
SELECT
CASE value
WHEN 1 THEN 'TRUE (Enabled)'
ELSE 'FALSE (Disabled)'
END AS Is_Preview_Features_Enabled
FROM sys.database_scoped_configurations
WHERE name = 'PREVIEW_FEATURES';
Use the ALTER statement above to set it up if it's not enabled. Otherwise, keep moving.
The Four Functions
There are four functions, built on two different algorithms that answer the same question -- how far apart are two strings? There's two because they just measure it differently.
The first algo is Levenshtein, which measures what's called the edit distance, which is the number of single-character changes -- add a letter, remove one, swap one -- it takes to turn one string into the other. 'Globx' into 'Globex' is one edit, just add the missing e. 'Cat' to 'Dog' is three. The smaller that number, the closer the two strings are to matching.
The second algo is Jaro-Winkler. It scores similarity on a different formula, and the part that matters is that it gives extra credit when two strings start the same way. We will see exactly where that pays off in a minute.
Each algorithm comes as a pair. One function hands you the raw distance, the other turns it into a 0-to-100 similarity score.
| Function | Returns | Range |
|---|---|---|
| EDIT_DISTANCE | Edits to turn one string into the other (insert, delete, substitute) |
0 = identical, higher = farther apart |
| EDIT_DISTANCE_SIMILARITY | Similarity score from the edit distance | 0 = no match, 100 = full match |
| JARO_WINKLER_DISTANCE | Distance, weighted toward strings that agree on their opening characters |
0 = identical, 1 = no match |
| JARO_WINKLER_SIMILARITY | Similarity, same prefix weighting | 0 = no match, 100 = full match |
One footnote before we move on. Microsoft's docs call this Damerau-Levenshtein and list transposition as a fourth operation -- until they admit a few lines down that transpositions are not supported yet. So today it is plain Levenshtein: three operations, not four. Test a straight swap like 'AB' against 'BA' and you get 2, not the 1 a real Damerau-Levenshtein would give you.
One other thing to keep straight is that distance and similarity run in opposite directions. A distance of 0 is a perfect match and climbs as the strings differ. A similarity of 100 is a perfect match and falls toward 0. So a good match means a low distance but a high similarity. Get that backwards in a WHERE clause and you will filter out exactly the rows you meant to keep.
A Quick Look at All Four
Here are five pairs run through all four functions at once, so you can see how they disagree.
SELECT T.source_string,
T.target_string,
EDIT_DISTANCE(T.source_string, T.target_string) AS ED_Distance,
JARO_WINKLER_DISTANCE(T.source_string, T.target_string) AS JW_Distance,
EDIT_DISTANCE_SIMILARITY(T.source_string, T.target_string) AS ED_Similarity,
JARO_WINKLER_SIMILARITY(T.source_string, T.target_string) AS JW_Similarity
FROM (VALUES ('Black', 'Red'),
('Colour', 'Yellow'),
('Colour', 'Color'),
('Microsoft', 'Msft'),
('Regex', 'Regex')
) AS T(source_string, target_string);
Look at the 'Microsoft' / 'Msft' row. Edit distance sees five changes and scores it a weak 44 because half the letters are gone. Jaro-Winkler is kinder at 51, because the two share an opening 'M' and the same letters in order. That difference is the whole personality split between the two algorithms, and it is why abbreviations are where they part ways.
The Real Job: Catching Duplicate Customers
The warm-up used literals. Here is the actual problem -- a staging table full of names a human typed, and a known-good master value you want to match them against.
CREATE TABLE dbo.CustomerImport
(
RowID INT IDENTITY(1,1) PRIMARY KEY,
EnteredName NVARCHAR(100) NOT NULL
);
INSERT dbo.CustomerImport (EnteredName)
VALUES (N'Globex Corporation'), -- clean
(N'Globexx Corporation'), -- doubled letter
(N'Globex Corporaton'), -- dropped an i
(N'Globx Corporation'), -- dropped an e
(N'Globex Corp'), -- abbreviation
(N'GLOBEX CORPORATION'); -- shift key never came up
Now score every entered name against the master string and attach a verdict. The similarity score does the work; the CASE expression just turns a number into a decision a human can act on.
SELECT EnteredName,
EDIT_DISTANCE(EnteredName, N'Globex Corporation') AS Edits,
EDIT_DISTANCE_SIMILARITY(EnteredName, N'Globex Corporation') AS Similarity,
CASE
WHEN EDIT_DISTANCE_SIMILARITY(EnteredName, N'Globex Corporation') >= 95
THEN 'Almost certainly the same'
WHEN EDIT_DISTANCE_SIMILARITY(EnteredName, N'Globex Corporation') >= 85
THEN 'Likely the same - review'
WHEN EDIT_DISTANCE_SIMILARITY(EnteredName, N'Globex Corporation') >= 70
THEN 'Possible match'
ELSE 'Probably different'
END AS Verdict
FROM dbo.CustomerImport
ORDER BY Similarity DESC;
The three typos land at 94 and 95 and all get flagged for review or better -- that's the point, a single dropped or doubled letter barely moves the score on an eighteen-character string. But the bottom two rows are also the same company, and both fell through to 'Probably different'. Those two failures are what you need to understand before you trust this in production.
'Globex Corp' scored 61 -- abbreviations are hard for edit distance.
'Globex Corp' is just the company name shortened. But turning 'Globex Corp' into 'Globex Corporation' means adding seven letters, so the edit distance is seven and the score drops to 61. Edit distance only counts the changes; it has no idea the seven missing letters spell a harmless 'oration'. Any abbreviation looks far apart for the same reason. Jaro-Winkler does better here, because the two names start identically, and that is exactly what it rewards. So when your data is full of 'Corp', 'Inc', and 'Ltd', try both and compare the scores.
'GLOBEX CORPORATION' scored 17 -- and that is the gotcha that gets everyone.
The Gotcha: These Functions Are Case-Sensitive
This is the one to circle. The same company name, in capitals, scored 17 out of 100. Not because the algorithm is wrong, but because right now these functions ignore your collation entirely. Microsoft says so plainly in the docs: "the functions don't adhere to the comparison semantics defined by collation settings, such as case insensitivity and other collation-specific rules."
So even on a case-insensitive database, which most of them are, an uppercase letter and its lowercase version count as a change. 'GLOBEX CORPORATION' differs from 'Globex Corporation' in fifteen places, so the engine counts fifteen edits and scores it 17.
-- Same company. Case-sensitive comparison thinks they are barely related.
SELECT EDIT_DISTANCE_SIMILARITY(N'Globex Corporation', N'GLOBEX CORPORATION') AS RawScore;
-- Normalize the case on both sides first, and the truth comes back.
SELECT EDIT_DISTANCE_SIMILARITY(
UPPER(N'Globex Corporation'),
UPPER(N'GLOBEX CORPORATION')) AS NormalizedScore;
The fix is one fast function call away. Wrap both sides in UPPER or LOWER. You need to pick one and stay consistent so that case never enters into the comparison. Until Microsoft teaches these functions to honor collation, do this on every fuzzy comparison you write. Every time.
EDIT_DISTANCE or JARO_WINKLER?
You have two algorithms. They are good at different things.
| Reach for | When |
|---|---|
| EDIT_DISTANCE | Typos, transpositions, dropped or doubled characters -- short, roughly equal-length strings |
| JARO_WINKLER | Names and strings that share a strong start but vary in length, including abbreviations |
Edit distance is literal. It simply counts changes. Great for the 'Corporaton' kind of typo, lousy at the 'Corp' kind, because a long missing tail reads as a pile of edits. Jaro-Winkler forgives a different ending when the beginning matches, which is how human-typed names usually drift. When you are unsure, run both and compare the scores, the way my warm-up query above did. Where they disagree is where your judgment goes.
The One Thing to Read First
Here is the honest version. On SQL Server 2025, fuzzy string matching is still in preview. It ships in the box, it works the moment you flip the database scoped configuration, and you can build and test on it today. But preview is not GA. This is the same caveat I raised on the native JSON data type, since preview behavior can change between cumulative updates.
| Where | Fuzzy matching status |
|---|---|
| SQL Server 2025 (box) | In preview -- requires PREVIEW_FEATURES = ON |
| Azure SQL Database | Generally available |
| Azure SQL Managed Instance | Available on the SQL Server 2025 / Always-up-to-date update policy |
| SQL database in Microsoft Fabric | Generally available |
One more thing, beyond the case issue: these functions run per row. Comparing two literals is free. Comparing a column against a master across a few million rows has real compute costs on every row. Narrow the field first with something cheap, like a same-first-letter filter, a length window, or an existing index, and then let the fuzzy function do its work on what's left. Don't point it at the whole table and just hope it works out.
The Bottom Line
The hand-rolled Levenshtein UDF was never wrong. It was just the best we had, and it dragged a per-row scalar function through every query that used it. SQL Server 2025 puts the math in the engine where it belongs, and the whole pattern collapses to a function call.
Edit Distance and Jaro-Winkler are the two newly introduced built-in fuzzy string-matching functions designed to identify typos, duplicate entries, and approximate text matches. This is a nice uptick from the custom T-SQL scripts we had to write and/or the CLR assemblies we used to achieve the same before SQL Server 2025.
Two things to remember. The functions are case-sensitive today regardless of your collation, so fold the case on both sides before you compare. And edit distance and Jaro-Winkler are not interchangeable. One counts keystrokes, the other forgives a different ending. Get those two right and you can finally answer 'is this the same customer?' within T-SQL.
Oh yeah -- also keep an eye on that preview flag until Microsoft moves it to GA in the box.
More to Read
Microsoft Learn: What is fuzzy string matching?
Microsoft Learn: EDIT_DISTANCE (Transact-SQL)
Microsoft Learn: JARO_WINKLER_SIMILARITY (Transact-SQL)
Microsoft Learn: ALTER DATABASE SCOPED CONFIGURATION (preview features)
sqlfingers inc: Cursed SQL -- Six Queries That Run Fine Until They Don't
sqlfingers inc: Your JSON Column Was Never a JSON Column



No comments:
Post a Comment