close
Skip to content

Pipelines

This page reflects the CLTK's pre-defined language Pipelines.

Stanza

Stanza models (for moprhology and syntax labeling) are available for the following languages.

Current Stanza pipeline map (generated from `src/cltk/languages/pipelines.py`)
MAP_LANGUAGE_CODE_TO_STANZA_PIPELINE: dict[str, type[Pipeline]] = {
    # Seed a few languages where Stanza has robust models
    "lati1261": LatinStanzaPipeline,
    "anci1242": AncientGreekStanzaPipeline,
    "chur1257": ChurchSlavonicStanzaPipeline,
    "oldf1239": OldFrenchStanzaPipeline,
    "goth1244": GothicStanzaPipeline,
    "lite1248": LiteraryChineseStanzaPipeline,
    "olde1238": OldEnglishStanzaPipeline,
    "otto1234": OttomanTurkishStanzaPipeline,
    "clas1256": ClassicalArmenianStanzaPipeline,
    "copt1239": CopticStanzaPipeline,
    "oldr1238": OldRussianStanzaPipeline,  # Old East Slavic
}

Generative AI

The CLTK has defined Pipeline for the following languages. These may be invoked by any generative LLM backend (i.e., "openai", "mistral", "ollama").

Current generative pipeline map (generated from `src/cltk/languages/pipelines.py`)
MAP_LANGUAGE_CODE_TO_GENERATIVE_PIPELINE: dict[str, type[Pipeline]] = {
    # Indo-European family
    ## Italic
    "lati1261": LatinGenAIPipeline,
    "oldf1239": OldFrenchGenAIPipeline,
    "midd1316": MiddleFrenchGenAIPipeline,
    # Other Romance languages
    ## Hellenic
    "anci1242": AncientGreekGenAIPipeline,
    # Mycenaean Greek (Linear B tablets, ca. 1400–1200 BCE).
    # Medieval/Byzantine Greek
    "oldi1245": EarlyIrishGenAIPipeline,
    "oldw1239": OldMiddleWelshGenAIPipeline,
    "bret1244": MiddleBretonGenAIPipeline,
    "corn1251": MiddleCornishGenAIPipeline,
    ## Germanic
    # Proto-Norse
    "goth1244": GothicGenAIPipeline,
    "oldh1241": OldHighGermanGenAIPipeline,
    "midd1343": MiddleHighGermanGenAIPipeline,
    "oldn1244": OldNorseGenAIPipeline,
    "olde1238": OldEnglishGenAIPipeline,
    "midd1317": MiddleEnglishGenAIPipeline,
    ## Balto-Slavic
    "chur1257": ChurchSlavonicGenAIPipeline,
    "prus1238": OldPrussianGenAIPipeline,
    "lith1251": LithuanianGenAIPipeline,
    "latv1249": LatvianGenAIPipeline,
    "gheg1238": AlbanianGenAIPipeline,
    ## Armenian, Earliest texts: 5th c. CE (Bible translation by Mesrop Mashtots, who created the script)
    "clas1256": ClassicalArmenianGenAIPipeline,
    "midd1364": MiddleArmenianGenAIPipeline,
    # Note this is only a parent, not true languoid
    ## Anatolian
    "hitt1242": HittiteGenAIPipeline,
    "cune1239": CuneiformLuwianGenAIPipeline,
    "hier1240": HieroglyphicLuwianGenAIPipeline,
    "lyci1241": LycianAGenAIPipeline,
    "lydi1241": LydianGenAIPipeline,
    "pala1331": PalaicGenAIPipeline,
    "cari1274": CarianGenAIPipeline,
    ## Tocharian
    "tokh1242": TocharianAGenAIPipeline,
    "tokh1243": TocharianBGenAIPipeline,
    ## Indo-Iranian
    ## Iranian languages
    ### SW Iranian
    "oldp1254": OldPersianGenAIPipeline,
    "pahl1241": MiddlePersianGenAIPipeline,
    ### NW Iranian
    "part1239": ParthianGenAIPipeline,
    ### E Iranian
    "aves1237": AvestanGenAIPipeline,
    "bact1239": BactrianGenAIPipeline,
    "sogd1245": SogdianGenAIPipeline,
    "khot1251": KhotaneseGenAIPipeline,
    "tums1237": TumshuqeseGenAIPipeline,
    # Indo-Aryan (Indic): Sanskrit (Vedic & Classical), Prakrits, Pali, later medieval languages (Hindi, Bengali, etc.)
    ## Old Indo-Aryan
    "vedi1234": VedicSanskritGenAIPipeline,
    "clas1258": ClassicalSanskritGenAIPipeline,
    # Prakrits (Middle Indo-Aryan, ca. 500 BCE–500 CE)
    "pali1273": PaliGenAIPipeline,
    # Ardhamāgadhī, Śaurasenī, Mahārāṣṭrī, etc. — languages of Jain/Buddhist texts and early drama.
    # ? Glotto says alt_name for Pali; Ardhamāgadhī, literary language associated with Magadha (eastern India); Jain canonical texts (the Āgamas) are written primarily in Ardhamāgadhī
    "saur1252": SauraseniPrakritGenAIPipeline,
    "maha1305": MaharastriPrakritGenAIPipeline,
    "maga1260": MagadhiPrakritGenAIPipeline,
    "gand1259": GandhariGenAIPipeline,  ## Middle Indo-Aryan
    # "Maithili": "mait1250"; Apabhraṃśa; "Apabhramsa" is alt_name; (500–1200 CE); Bridges Prakrits → New Indo-Aryan
    ## New Indo-Aryan
    ## Medieval languages (~1200 CE onward):
    # Early forms of Hindi, Bengali, Gujarati, Marathi, Punjabi, Oriya, Sinhala, etc
    # North-Western / Hindi Belt
    "hind1269": HindiGenAIPipeline,
    "khad1239": KhariBoliGenAIPipeline,
    "braj1242": BrajGenAIPipeline,
    "awad1243": AwadhiGenAIPipeline,
    "urdu1245": UrduGenAIPipeline,
    # Eastern Indo-Aryan
    "beng1280": BengaliGenAIPipeline,
    "oriy1255": OdiaGenAIPipeline,
    "assa1263": AssameseGenAIPipeline,
    # Western Indo-Aryan
    "guja1252": GujaratiGenAIPipeline,
    "mara1378": MarathiGenAIPipeline,
    # Southern Indo-Aryan / adjacency
    "sinh1246": SinhalaGenAIPipeline,
    # Northwestern frontier
    "panj1256": EasternPanjabiGenAIPipeline,
    "sind1272": SindhiGenAIPipeline,
    "kash1277": KashmiriGenAIPipeline,
    "bagr1243": BagriGenAIPipeline,
    # Afroasiatic family
    ## Semitic languages
    ### East Semitic
    "akka1240": AkkadianGenAIPipeline,
    # Eblaite
    ### West Semitic
    "ugar1238": UgariticGenAIPipeline,
    "phoe1239": PhoenicianGenAIPipeline,
    "moab1234": MoabiteGenAIPipeline,
    "ammo1234": AmmoniteGenAIPipeline,
    "edom1234": EdomiteGenAIPipeline,
    "anci1244": BiblicalHebrewGenAIPipeline,
    # Medieval Hebrew: No Glottolog
    # "moab1234": Moabite
    # "ammo1234": Ammonite
    # "edom1234": Edomite
    # Old Aramaic (ca. 1000–700 BCE, inscriptions).
    # "olda1246": "Old Aramaic (up to 700 BCE)",
    # "Old Aramaic-Sam'alian": "olda1245"
    "impe1235": ImperialAramaicGenAIPipeline,
    "olda1246": OldAramaicGenAIPipeline,
    "olda1245": OldAramaicSamalianGenAIPipeline,
    "midd1366": MiddleAramaicGenAIPipeline,
    "clas1253": ClassicalMandaicGenAIPipeline,
    "hatr1234": HatranGenAIPipeline,
    "jewi1240": JewishBabylonianAramaicGenAIPipeline,
    "sama1234": SamalianGenAIPipeline,
    # "midd1366": Middle Aramaic (200 BCE – 700 CE), includes Biblical Aramaic, Palmyrene, Nabataean, Targumic Aramaic.
    # Eastern Middle Aramaic
    ##  Classical Mandaic, Hatran, Jewish Babylonian Aramaic dialects, and Classical Syriac
    "clas1252": ClassicalSyriacGenAIPipeline,
    ### NW Semitic
    ## South Semitic
    # Old South Arabian (OSA)
    "geez1241": GeezGenAIPipeline,
    ### Central Semitic (bridge between NW and South)
    # Pre-Islamic Arabic
    "clas1259": ClassicalArabicGenAIPipeline,  # Dialect
    # Glotto doesn't have medieval arabic; Medieval Arabic: scientific, philosophical, historical works dominate much of the Islamic Golden Age corpus.
    ## Egyptian languages
    "olde1242": OldEgyptianGenAIPipeline,
    "midd1369": MiddleEgyptianGenAIPipeline,
    "late1256": LateEgyptianGenAIPipeline,
    "demo1234": DemoticGenAIPipeline,
    "copt1239": CopticGenAIPipeline,
    ## Berber
    "numi1241": NumidianGenAIPipeline,
    "tait1247": TaitaGenAIPipeline,
    ## Chadic
    # ; "haus1257": "Hausa"; Hausa; Essentially oral until medieval period, when Hausa is written in Ajami (Arabic script).
    "haus1257": HausaGenAIPipeline,
    "lite1248": LiteraryChineseGenAIPipeline,
    "clas1254": ClassicalTibetanPipeline,
    # Sino-Tibetan family
    # | **Early Vernacular Chinese (Baihua)**   | ca. 10th – 18th c. CE | *(under `clas1255`)* |
    # | **Old Tibetan**                         | 7th – 10th c. CE     | *(not separately coded)* |
    "oldc1244": OldChineseGenAIPipeline,
    "midd1344": MiddleChineseGenAIPipeline,
    "clas1255": BaihuaChineseGenAIPipeline,
    "oldb1235": OldBurmeseGenAIPipeline,
    "nucl1310": ClassicalBurmeseGenAIPipeline,
    "tang1334": TangutGenAIPipeline,
    "newa1246": NewarGenAIPipeline,
    "mani1292": MeiteiGenAIPipeline,
    "sgaw1245": SgawKarenGenAIPipeline,
    # Mongolic family
    "mong1329": MiddleMongolGenAIPipeline,
    "mong1331": ClassicalMongolianGenAIPipeline,  #  TODO: No glottolog broken
    "mogh1245": MogholiGenAIPipeline,
    # Altaic-Adj.
    "jurc1239": OldJurchenGenAIPipeline,
    # Japonic
    "japo1237": OldJapaneseGenAIPipeline,
    # Uralic
    "oldh1242": OldHungarianGenAIPipeline,
    # Turkic
    "chag1247": ChagataiGenAIPipeline,
    "oldu1238": OldTurkicGenAIPipeline,
    # TODO: Make pipeline for Ottoman Turkish
    # "otto1234": OttomanTurkishGenAIPipeline,
    # Dravidian
    "oldt1248": OldTamilGenAIPipeline,
    # Pre-Modern Literate Language Families (Non-Euro/Afroasiatic/Sino-Tibetan/Mongolic)
    # | Family         | Language / Stage           | Approx. Period      | Glottocode     |
    # |----------------|-----------------------------|----------------------|----------------|
    # | Dravidian      | Old Tamil                   | ca. 300 BCE–300 CE   | `oldt1248`     |
    # |                | Middle Tamil                | medieval             | *(not coded)*  |
    # |                | Old Kannada                 | from 5th c. CE       | *(not coded)*  |
    # |                | Old Telugu                  | from 6th c. CE       | *(not coded)*  |
    # |                | Old Malayalam               | from 13th c. CE      | *(not coded)*  |
    # | Turkic         | Old Turkic                  | 8th–10th c. CE       | `oldu1238`     |
    # |                | Chagatai                    | 15th–18th c. CE      | `chag1247`     |
    # | Uralic         | Old Hungarian               | 12th–13th c. CE      | `oldh1242`     |
    # | Koreanic       | Old Korean                  | 7th–10th c. CE       | *(not coded)*  |
    # |                | Middle Korean               | 15th c. onward       | *(not coded)*  |
    # | Japonic        | Old Japanese                | 8th c. CE            | `japo1237`     |
    # | Altaic-Adj.    | Old Jurchen                 | 12th–13th c. CE      | *`jurc1239`*  |
    # |                | Manchu                      | 17th–18th c. CE      | *(not coded)*  |
    # | Austroasiatic  | Old Mon                     | from 6th c. CE       | *(not coded)*  |
    # |                | Old Khmer                   | from 7th c. CE       | *(not coded)*  |
    # | Austronesian   | Old Javanese (Kawi)         | from 8th c. CE       | *(not coded)*  |
    # |                | Classical Malay             | from 7th c. CE onward| *(not coded)*  |
    # | Tai–Kadai      | Old Thai                    | from 13th c. CE      | *(not coded)*  |
}

User-Defined Pipelines

See User-Defined Pipelines for documentation on how to create a custom Pipeline.