Original research · 2026-06-02

English Initial Letter Distribution: Which Letters Generate the Most Words and Confusables (2026)

Not all 26 letters share the vocabulary burden equally. The distributions below are generated server-side at request time from the live word_letter_counts and confusable_letter_counts tables, two separate questions with notably different answers, each driven by distinct structural forces in English morphology.

Two distributions, two questions

This article presents two parallel letter-frequency analyses. The first counts how many English vocabulary entries begin with each letter across the full PlainSpell corpus. The second counts how many confusable word pairs have both members beginning with the same letter. These are distinct linguistic phenomena with different root causes: vocabulary density is driven primarily by productive prefix morphology, while confusable density is driven by the intersection of prefix overlap and short edit distances between real words. A letter that scores high on both charts sits at a genuine high-risk zone for writers, where the vocabulary is large and the words within that vocabulary are difficult to distinguish from each other.

Both queries draw on the PlainSpell English corpus, which covers the full Wiktionary English vocabulary as of the most recent ETL run. Letter counts reflect the first Unicode character of the headword after lowercasing and normalization. All figures are live SSR values at render time.

Chart 1: Top 12 letters by vocabulary entry count

English words by initial letter

Top 12 initial letters, ranked by how many English vocabulary entries begin with each

words
Source PlainSpell · Wiktionary corpus As of 2026

Browse the vocabulary behind each letter: s · p · c · m · a · b · t · d · n · h · u · r

Reference query: SELECT letter, cnt FROM word_letter_counts WHERE lang='en' ORDER BY cnt DESC LIMIT 12;

Chart 2: Top 12 letters by confusable pair count

Confusable pairs by initial letter

Top 12 initial letters, ranked by how many confusable word pairs begin with each

confusable pairs
Source PlainSpell · Wiktionary corpus As of 2026

Reference query: SELECT letter, cnt FROM confusable_letter_counts WHERE lang='en' ORDER BY cnt DESC LIMIT 12;

Finding 1: Latin and Greek prefixes cluster under 's', 'p', and 'c'

The high entry counts for the letters s, p, and c in English vocabulary are not accidental, they reflect the productive prefix morphology that English inherited from Latin and Greek through successive waves of scholarly and ecclesiastical borrowing. The Latin prefixes sub-, super-, semi-, syn-, sym- all begin with s; pre-, pro-, per-, post-, para- all begin with p; and con-, com-, contra-, circum-, cata- all begin with c. Each of these prefixes is attached to hundreds of stems, generating large families of derived words that all share an initial letter. The result is a heavily skewed distribution where three letters account for a disproportionate share of the total vocabulary. Old English native roots (the short, high-frequency function words) contribute relatively little to total entry count, they are few in number and deeply familiar, while the vast Latinate and Greek scientific vocabulary dominates the count, and that vocabulary is prefix-organized under precisely these three letters.

Finding 2: 's', 'c', and 'b' lead confusable pairs for different reasons

The confusable pair leadership of s, c, and b does not have a single explanation. For s, the cause is similar to the word-count leadership: a large vocabulary under s means more opportunities for edit-distance proximity between distinct entries. Many s-initial words are prefix variants of each other (sub/sup, syn/sym) where the prefix alternation produces a real confusable pair. For c, the con-/com- prefix family creates dense clusters: conform and comfort, contest and context, conscience and conscious are all near-matches generated by morphological productivity under the same prefix. The letter b tells a different story: it benefits from a high density of short native English words (be, by, but, bid, bad, bag, ban, bar, bat, bay) that form a tightly interconnected edit-distance network. The short-word effect described in the confusable-pairs research article applies with particular force here, a 3-letter b-word has an unusually large fraction of edit-distance-1 neighbors that are themselves real English words.

Finding 3: The 'un-' and 're-' prefix families boost 'u' and 'r' beyond their baseline

Two letters whose vocabulary counts exceed what letter-frequency in ordinary English text would predict are u and r. The productive English prefixes un- (negation of adjectives and verbs: unclear, uneven, unwilling, unreliable) and re- (repetition: rebuild, reconsider, rearrange, redefine) generate large derived-word families under these initial letters. The un- family is particularly large because English applies it to an almost unlimited range of adjectives and past participles, while re- is equally productive with verbs. Neither prefix family has an especially close Latinate rival, dis- and de- are the nearest competitors for negation and reversal, so the u and r counts reflect genuine morphological productivity rather than a borrowed prefix cluster.

What this means for spelling-resource coverage and writer guidance

Dictionary editors and spelling-resource designers who aim for comprehensive coverage face an uneven workload: adding entries under s, p, and c requires processing substantially more candidate words than adding entries under x, z, or q. The entry-count distribution shown above is a rough proxy for editorial effort allocation. It also has practical implications for writers. If you are writing in a domain that makes heavy use of Latin-derived technical vocabulary, medicine, law, science, policy, the letters s, p, and c are where your highest-risk misspellings and confusable pairs will cluster. Proofreading strategies that treat all 26 letters as equally likely sites of error are miscalibrated for this reality. A targeted review of s-initial, p-initial, and c-initial technical terms before submission is a higher-ROI activity than a uniform word-by-word check. The full methodology used to compute these distributions, including how the ETL pipeline handles compound words, hyphenated forms, and proper nouns, is documented on the PlainSpell methodology page.

Letter distribution across languages: a brief comparative note

The s/p/c dominance pattern is specific to English and its particular Latin-Greek inheritance. French shows a similar but more pronounced s and p bias, reflecting the same prefix families reinforced by Romance inflectional morphology. German's letter distribution is flatter and more spread across the alphabet, because German productive morphology relies heavily on compounding (which distributes first-letter counts more evenly) rather than prefix derivation (which concentrates them). Spanish, like French, shows strong s and p counts. Portuguese is broadly similar to Spanish in distribution shape but with smaller absolute counts due to the corpus coverage gap documented in the Wiktionary coverage analysis. Cross-language letter distributions are available in the per-language browse sections of PlainSpell.

Methodology

The word_letter_counts table is materialized at ETL build time by extracting the first character (lowercased, Unicode-normalized) of every entry in the words table for each language partition and grouping by that character. The confusable_letter_counts table applies the same extraction to the first word in each confusable pair (both members of every pair begin with the same letter by construction in the current pairing algorithm, which filters to within-initial-letter pairs to reduce computational cost). Only ASCII a-z entries are counted; entries beginning with digits, punctuation, or non-ASCII characters are excluded from both tables. The full specification is on the PlainSpell methodology page.

Limitations: Counting only within-initial-letter confusable pairs means the confusable_letter_counts table undercounts true cross-letter confusable pairs (e.g. "cue" vs "queue" share an initial letter by our rule but differ in standard alphabetic ordering). The within-letter restriction is a computational simplification. Words beginning with capital letters (proper nouns) are lowercased and included in the same letter bucket, which slightly inflates counts for letters with common proper-noun initials (M for Mary/Mark/March, J for John/June/July).

Sources

Source: Wiktionary (English edition) JSONL dump via wiktextract · 2026 Open data under CC BY-SA 4.0.

Source: Bauer, Laurie, English Word-Formation Cambridge University Press · 1983 Foundational reference on English prefix and suffix morphology.

Source: Crystal, David, The Cambridge Encyclopedia of the English Language 2nd Edition · 2003 Reference work on English vocabulary history and morphological structure.