Original research · 2026-06-02
The Most Confusable English Word Pairs (2026)
Which English word pairs are most easily confused? The ranking below is generated server-side at request time from the live rankings table, ordered by rank. The top pairs are not obscure vocabulary, they are short, high-frequency function words where a single character substitution produces a completely different grammatical unit.
What makes a word pair "confusable"?
In the PlainSpell corpus, two words are classified as a confusable pair when they share a small edit distance (typically 1-2 character operations: substitution, insertion, or deletion), belong to the same language, and are both real dictionary entries in Wiktionary. The internal confusability score in the rankings table reflects the algorithmic proximity measure used at ETL time. Rank 1 indicates the pair that scores highest on this measure across the English vocabulary. Importantly, the confusability score is an internal corpus metric, not a direct measure of how often human writers confuse the two words in practice, which would require a separate empirical corpus study.
Our English confusables table contains 529,999 pairs, every algorithmically identified confusable pair across the full PlainSpell English vocabulary. The rankings below surface only the 15 pairs that score highest by rank. Understanding why these pairs cluster at the top reveals something fundamental about English orthography and cognitive load.
Top 15, ranked by confusability (rank 1 = most confusable)
Reference query: SELECT rank, name, slug, value FROM rankings WHERE type='most_confusable' ORDER BY rank ASC LIMIT 15;
Pattern 1: High frequency amplifies small edit distances
The pairs at the top of the confusability ranking share a consistent characteristic: both words in the pair are extremely common in everyday English. Function words like "that", "this", "they", "the", "with", "their", and "which" appear hundreds of times per thousand words of running text. The edit distance between many of these pairs is just one or two character operations, yet a single-character substitution transforms one grammatical function entirely. Swapping "that" for "this" changes a distal demonstrative to a proximal one; swapping "they" for "the" removes agency from a sentence. The cognitive cost is not in the character count but in the grammatical pivot. Writers who type fast and correct rarely catch these substitutions precisely because both alternatives are valid English words that pass any spell-check.
Pattern 2: Short words have proportionally large error neighborhoods
A 4-letter word has a much larger edit-distance-1 neighborhood relative to its length than a 16-letter word. For a word like "that" (4 characters), edit-distance-1 substitutions alone generate 3 × 25 = 75 candidate strings, many of which are valid English words. For a 16-character word, the same operation generates many more strings, but a far smaller fraction of them coincide with real words, so most misspellings of long words produce obvious non-words that spell-checkers catch immediately. Short, high-frequency function words sit in a dense real-word neighborhood where nearly every substitution produces another real word. That density is precisely what the confusability ranking measures, and it explains why the top pairs are uniformly short.
Pattern 3: Grammatical interchangeability compounds the confusion
Many top-ranked pairs are not just phonetically or orthographically similar, they are also grammatically substitutable in certain sentence contexts. "With" and "wish" belong to different grammatical categories (preposition vs. verb), yet they can both appear in contexts like "I _ you well" where the surrounding syntax does not immediately flag the error. "Their" and "there" are both valid sentence-final positions and both unstressed in speech. "Which" and "witch" differ by a single initial consonant but only one is a function word. Grammatical interchangeability is a multiplier on the basic edit-distance confusability score: a pair that is both orthographically close AND grammatically substitutable represents a higher practical writing risk than an equally close pair where one word could never occupy the same syntactic slot as the other. Spell-checkers that perform contextual grammar checks (like Grammarly's context-aware suggestion engine) handle this dimension better than pure edit-distance engines, but even the best contextual checkers miss many instances in complex compound sentences.
Implications for writer's tools and plain-language editing
The existence of a dense confusable-pair neighborhood around the most common English function words has direct implications for anyone building writer-assistance tools or working in plain-language editing. Any tool that ranks spell-check suggestions purely by edit distance will systematically surface the wrong candidate for exactly these high-frequency pairs, because the closest candidate by distance is often a common word in the same grammatical class. A better heuristic weights suggestions by the conditional probability of the intended word given the surrounding context, a Bayesian approach that language models implement naturally. Plain-language editors working on legal, medical, or government documents should pay particular attention to this class of near-miss error: the stakes of confusing "with" and "wish", or "their" and "there", in a binding contract or clinical note are disproportionate to the tiny edit distance involved.
The full confusables methodology, including how PlainSpell handles multi-word confusables, homophone overlap with confusable pairs, and cross-language confusability, is documented on the PlainSpell methodology page.
Methodology
Confusable pairs are identified at ETL build time by computing the edit distance between every pair of English vocabulary entries in the PlainSpell words table, using a Levenshtein-based approach that accounts for character transpositions (Damerau-Levenshtein distance). Pairs with a distance at or below the confusability threshold for their combined length are stored in the confusables table. The rankings table materialises the type='most_confusable' ranking at build time and orders pairs by their internal confusability score. For the full algorithm specification, see the PlainSpell Methodology page.
Limitations: The confusability score is a corpus-internal algorithmic metric, not an empirically validated measure of human error frequency. Pairs that score highly may not be equally confusing to all writers, expert writers may navigate high-ranked pairs effortlessly while struggling with lower-ranked domain-specific pairs. The ranking reflects orthographic proximity weighted by the ETL pipeline's scoring parameters; different weighting schemes would produce a different ordering. The value column in the rankings table represents an internal score whose exact unit is not specified in the public schema and should not be interpreted as a directly comparable quantity across different ranking types.
Sources
Source: Wiktionary (English edition) JSONL dump via wiktextract · 2026 Open data under CC BY-SA 4.0.
Source: Damerau, F.J., A Technique for Computer Detection and Correction of Spelling Errors CACM 7(3) 1964 · 1964 Original formulation of the edit-distance approach underlying confusable-pair detection.
Source: Norvig, Peter, How to Write a Spelling Corrector Practical spelling-correction reference · 2007 Canonical practical reference for probabilistic edit-distance spell-checking.