Original research · 2026-05-14

Wiktionary Coverage by Language: How 5 Living Languages Compare in 2026

French Wiktionary contains 4,485,239 entries, over 113× more than Portuguese (39,583). What this skew reveals about open-data lexicons.

Why this comparison matters

Wiktionary is the largest open-data multilingual dictionary on the public web. Its coverage shapes every downstream resource, language-learning apps, spell-checkers, NLP corpora, and the PlainSpell knowledge graph itself. But entry counts vary wildly across language editions, and that skew is rarely surfaced.

The numbers below are queried from the PlainSpell lang_stats snapshot on 2026-05-14. Every figure is reproducible from the underlying Wiktionary JSONL dump used by our ETL pipeline. We sampled five widely spoken European languages: English, Spanish, French, German, Portuguese, collectively home to over 1.6 billion native speakers per Ethnologue 2024.

Entry counts at a glance

Language Words Confusables Homophones Misspellings % of total
French (fr) 4,485,239 440,172 21,890 467,627 64.8%
German (de) 1,077,739 2,006,359 2,859 547,546 15.6%
Spanish (es) 770,428 323,831 812 449,955 11.1%
English (en) 545,755 529,999 2,182 374,588 7.9%
Portuguese (pt) 39,583 75,631 78 134,916 0.6%
Total (5 languages) 6,918,744 3,375,992 27,821 1,974,632 100%

Reference query: SELECT lang, word_count, confusable_count, homophone_count, misspelling_count FROM lang_stats ORDER BY word_count DESC;

Finding 1: French dominates entry count

French Wiktionary holds 4,485,239 entries, roughly 65% of all entries across the five languages we track. This single dataset is larger than the next four combined. Three structural reasons explain the gap:

  • Active French Wiktionary editor community. The French Wiktionary statistics page reports the project crossed 5 million entries in 2024, earlier than the English edition.
  • Verb-conjugation entries. French verbs are highly inflected. The fr.wiktionary project policy is to create a separate page for each conjugated form (e.g. "parle", "parles", "parlons"), which inflates entry counts dramatically compared to dictionaries that consolidate forms under a single lemma.
  • Inclusion of obscure/archaic vocabulary. The French project has historically been more permissive about including rare and historical words, expanding total volume.

Finding 2: Portuguese is dramatically under-represented

Portuguese has only 39,583 entries, just 0.6% of the five-language total. Yet Portuguese is the 5th-most-spoken language in the world by native speakers (~232 million per Ethnologue 2024). The disconnect between speaker base and digital lexicon coverage is striking.

This is consistent with broader research on digital language inequality. The Wiktionary cross-edition statistics show Portuguese ranks 12th by entry count despite ranking 5th by speaker population, a structural under-resourcing pattern documented in multiple studies of open-language data (e.g. Joshi et al., "The State and Fate of Linguistic Diversity and Inclusion in the NLP World", ACL 2020).

Finding 3: Confusables vs. words ratio reveals editor focus

The number of confusable pairs per language tells a different story than raw entry count. German shows the highest confusable density: 1.86 pairs per word entry, far above French (0.10). This reflects German's compounding-heavy morphology, long compound words spawn many near-duplicates differing by single letters. English follows close behind at 0.97 pairs per word, driven by its large stock of phonetically similar Latinate vs. Germanic doublets.

What this means for word-data research

Researchers building cross-language lexical resources should weight findings carefully. A "most-misspelled word" ranking computed on French-Wiktionary data covers a far larger entry universe than one computed on Portuguese, making naive cross-language comparisons misleading. PlainSpell rankings always disclose the underlying entry count to make this skew visible.

The misspelling-to-word ratio as a coverage quality signal

Raw entry counts tell only part of the story. The misspelling count per word entry is a more nuanced signal of how deeply each language's Wiktionary edition has been processed. In the PlainSpell corpus, Portuguese generates 3.41 misspelling variants per word entry, while French generates 0.10 per entry and German 0.51 per entry. A higher ratio does not necessarily mean more error-prone vocabulary; it often means more of the vocabulary has been through the full algorithmic misspelling pipeline, which requires a clean and complete word entry to begin with. Languages with sparse or inconsistently formatted entries generate fewer misspellings not because their words are simpler, but because the generator cannot produce variants for entries that lack a clean headword string. This makes the misspelling rate a secondary diagnostic for Wiktionary data quality, on top of its primary function as a spell-checker signal. The full pipeline methodology used to produce these counts is described on the PlainSpell methodology page, including the cleaning steps applied to raw Wiktionary headwords before variant generation begins.

Why coverage gaps matter beyond language technology

Wiktionary's coverage skew has consequences that extend well beyond academic NLP research. Consumer spell-checkers, educational apps, and translation tools all draw on open lexical datasets, many of which use Wiktionary as either a primary source or a validation layer. When Portuguese is under-represented by a factor of 113 relative to French, every downstream tool inherits that gap. A student writing in Portuguese using a Wiktionary-backed spell-checker receives fewer candidate suggestions per misspelled word, not because Portuguese has fewer correct spellings, but because the open-data infrastructure behind it is less complete. The same dynamic plays out for Spanish, which has a far larger global speaker base than French yet holds fewer Wiktionary entries in our five-language snapshot. Closing the coverage gap is therefore a form of digital infrastructure work with real educational equity implications, and community-driven Wiktionary contribution campaigns in under-represented languages have measurable downstream effects on tool quality. The entry counts shown in this article will be refreshed with each annual PlainSpell corpus update to track progress over time.

Methodology

Word counts are pulled from the PlainSpell lang_stats table on 2026-05-14, which is materialized at ETL build time by counting rows in words grouped by language code. The underlying source is the public Wiktionary JSONL dump processed by wiktextract (Tatu Ylönen, 2024). Confusable, homophone, and misspelling counts are derived rows from algorithmic pairing within each language partition. No cross-language pairs are included. All counts exclude redirects and stub entries.

Limitations: Wiktionary entry counts conflate inflected forms (verbs, plurals) with lemmas in different ways across language editions, which inflates French and German totals relative to dictionaries that consolidate under lemmas. Speaker-population figures for the "speakers vs. coverage" framing are from Ethnologue 2024 (200 most-spoken languages).

Sources

Source: Wiktionary (multilingual) JSONL dump via wiktextract · 2026 Open data under CC BY-SA 4.0.

Source: Ethnologue 2024, Languages of the World Top 200 languages by speakers · 2024

Source: Joshi et al., The State and Fate of Linguistic Diversity in NLP ACL 2020 · 2020 Foundational study on digital language inequality.