Data Sources & Licenses

I built this on top of several open linguistic datasets. Every analysis you run surfaces data from one or more of them. Their authors did the work; this page names them and their licenses so attribution travels with the data.

Data sources used by Diachronica's etymology analyzer
Source License What I use it for
Wiktionary CC BY-SA 4.0 Etymology text, definitions, pronunciations, related words
Glottolog CC BY 4.0 Language families, tree structure, coordinates (~8,500 languoids)
Lexibank CC BY 4.0 (per dataset) Expert-annotated cognate sets (4,981 sets, 25,741 members). Cards on the etymology page mark Lexibank-sourced entries with a ★ badge.
IE-CoR CC BY 4.0 Indo-European Cognate Relationships: 25,731 lexemes across 160 languages with LIV²/NIL references. Distributed via Lexibank.
WOLD CC BY 4.0 World Loanword Database: documented borrowings with source language and confidence. Powers the loanword badges (➜ glyph) and the "borrowed from X" pills on headwords and cognates. Haspelmath, Martin & Tadmor, Uri (eds.) 2009. WOLD. Leipzig: Max Planck Institute for Evolutionary Anthropology.
ASJP CC BY 4.0 Automated Similarity Judgment Program: 40 Swadesh-list concepts surveyed across 11,540 languages. Powers the "attested in N languages worldwide" coverage stat. Wichmann, Søren, Eric W. Holman & Cecil H. Brown (eds.) 2022. The ASJP Database (version 20).
CLICS³ CC BY 4.0 Database of Cross-Linguistic Colexifications: which concepts share a single word across language families. Powers the "in other languages this word also means..." section. Rzymski, Tresoldi et al. 2020. The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies.
PHOIBLE CC BY-SA 3.0 Phoneme inventories (via the sister /api/linguistic/ endpoints)
WALS CC BY 4.0 Typological features (via /api/linguistic/)
ISO 639-3 Open (SIL attribution) Canonical three-letter language codes
COCA Licensed (paid) Modern American English word frequency, 1990–2019. Licensed from Mark Davies / english-corpora.org; redistribution restricted, summary statistics surfaced here.
Wikimedia Commons Varies (mostly CC BY-SA) Illustrative images; each hero caption links back to the file page with its specific license

A note on share-alike

Wiktionary and PHOIBLE are share-alike licenses: anything I ship that meaningfully incorporates them inherits the same license terms. That means the etymology graphs and data tables you see here are re-distributable under CC BY-SA: take them, build on them, credit the source.

Something missing?

If I surfaced data from a source not named here, or if an attribution needs fixing, let me know. Reach out to luke@lukesteuber.com or open an issue on the code at github.com/lukeslp/diachronica.