Linguistics

Open-source projects categorized as Linguistics

Top 23 Linguistic Open-Source Projects

  • tatoeba2

    Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.

  • Project mention: The AI Revolution Is Crushing Thousands of Languages | news.ycombinator.com | 2024-04-25

    Alternate take, it can also help people learn niche languages if native speakers contribute to data sets. For example, I've been using Clozemaster for the past few months as a way to work on vocabulary on some languages, and they pull their dataset from Tatoeba [1]. I was very surprised to see that my father's native language, Kabylie, which is admittedly a somewhat niche language, is one of the top languages by sentence contribution in the dataset (over 700k entries, more than French or Spanish or German). I showed him the sentences once and he confirmed that yes, they all seem like what a native speaker would say. Not all of them have translations into other languages of course, and a lot of them are slight variations on each other, but some native speakers are there contributing. It's not currently an option to use in Clozemaster -- I'm guessing the TTS isn't really there -- but I totally could see these as gaps that are easily filled.

    Same with my wife's native language (Bengali). There are surprisingly few language learning resources for Bangla, even though it's the 7th most spoken language in the world. But there it is in the data set with TTS and the ability for Clozemaster to have ChatGPT "explain" what's going on in the sentence (a very useful feature for new speakers).

    Anyway, I don't view AI as good or bad, just another tool that we should be intentional about when we cultivate the data sets underlying the tool.

    [1] https://tatoeba.org

  • ipa-dict

    Monolingual wordlists with pronunciation information in IPA

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • rime-cantonese

    Rime Cantonese input schema | 粵語拼音輸入方案

  • Project mention: How to type Jyutcitzi? 【RIME keyboard installation manual】? | /r/CantoneseScriptReform | 2023-12-07

    Please follow instructions at https://github.com/rime/rime-cantonese/wiki and https://github.com/rime/rime-cantonese/wiki/新手安裝教程 In a nutshell, download and install using the following files: Mac: mac-2021.05.16-installer.pkg Windows: windows-sfx-2021.05.16-installer.exe Linux: Download and run ibus-install.sh Please check to ensure that RIME Cantonese is properly installed before proceeding to Step 3.

  • awesome-linguistics

    A curated list of anything remotely related to linguistics

  • Project mention: Untranslatable | news.ycombinator.com | 2024-01-26

    Very interesting they were funded from Kickstarter! 292 backers at 10k€. I assumed you needed quite the following for Kickstarter to work...

    And it looks like they do. 49k followers on Facebook and 16k on Instagram. Not sure how far back these go, but looks like very "shareable" content, where they would take I translatable words and make little funny pictures or memes or other intriguing things and post them. Lots of interaction comments/reaction-wise

    Timeline-wise this was backed on Kickstarter in 2020. Site launched in summer 2020. The creator was very active on Kickstarter working on communicating and updating the community with what was going on (until the end there).

    Also seems to have a Patreon, and worked itself into other places like https://github.com/theimpossibleastronaut/awesome-linguistic...

  • wikipron

    Massively multilingual pronunciation mining

  • ichiran

    Linguistic tools for texts in Japanese language

  • prosodic

    Prosodic: a metrical-phonological parser, written in Python. For English and Finnish, with flexible language support.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • dev

    PHOIBLE data and development. (by phoible)

  • Project mention: Does someone have a phonemic inventory of all the romance languages, a list of all the phonemes in all the romance languages ? | /r/linguistics | 2023-05-11

    Does the language you’re thinking of have an inventory on https://phoible.org/?

  • TextAnnotationGraphs

    A modular annotation system that supports complex, interactive annotation graphs embedded on top of sequences of text.

  • odict

    A blazingly-fast, offline-first format and toolchain for lexical data 📖

  • OpenGNT

    Open Greek New Testament Project; NA28 / NA27 Equivalent Text & Resources

  • ambuda

    Main application code for Ambuda, a breakthrough Sanskrit library (ambuda.org)

  • Project mention: The Theorist Who Sees Math in Art, Music and Writing | news.ycombinator.com | 2024-03-04

    >"Thousands of years ago in India, poets were trying to think about the possible meters. In Sanskrit poetry, you have long and short syllables. Long is twice as long as short. If you want to work out how many there are that take a length of time of three, you can have short, short, short, or long, short, or short, long. There are three ways to make three. There are five ways to make a length-four phrase. And there are eight ways to make a length-five phrase. This sequence you’re getting is one where every term is the sum of the previous two. You exactly reproduce what we nowadays call the Fibonacci sequence. But this was centuries before Fibonacci."

    Related:

    Ambuda: "Building the world's largest Sanskrit library":

    https://ambuda.org/

  • langstats

    A visual color bar of the programming languages in your directory, with percentages and labels

  • zeroshot_topics

    Topic Inference with Zeroshot models

  • tone

    A Cross-Cultural Writing System (by termsurf)

  • treebender

    A HDPSG-inspired symbolic natural language parser written in Rust

  • langua

    A suite of language tools

  • WonderfulPolishLanguage

    This is a repository created for the list of resources for learning and exploring Wonderful Polish language.

  • proiel-treebank

    Official releases of the PROIEL treebank of ancient Indo-European languages

  • iso639

    ISO 639 language codes (by jacksonllee)

  • syn

    🌾 Get synonyms and antonyms of words from Thesaurus.com and other sources in your terminal, with rich output. (by agmmnn)

  • google-books-ngram-frequency

    Word/n-gram frequency lists for the Google Books Ngram Corpus (v3, all languages) with Python code

  • top-open-subtitles-sentences

    Most common sentences and words for all languages in the OpenSubtitles2018 corpus with Python code

  • Project mention: A colloquial (عامیانه) frequency list! Our prayers have been answered. | /r/farsi | 2023-08-03
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Linguistics related posts

  • The AI Revolution Is Crushing Thousands of Languages

    2 projects | news.ycombinator.com | 25 Apr 2024
  • Untranslatable

    2 projects | news.ycombinator.com | 26 Jan 2024
  • A colloquial (عامیانه) frequency list! Our prayers have been answered.

    1 project | /r/farsi | 3 Aug 2023
  • Draw Syntactic Trees!

    1 project | /r/LinguisticsAtUofT | 5 Jul 2023
  • Seeking your insights on "Loquax": A tool for phonological analysis

    3 projects | /r/latin | 30 May 2023
  • Does someone have a phonemic inventory of all the romance languages, a list of all the phonemes in all the romance languages ?

    1 project | /r/linguistics | 11 May 2023
  • Are there any websites gathering graded readers in different languages; either making it themselves or simply sharing sources. I’m specifically looking for fiction books in Polish A1/A2.

    1 project | /r/languagelearning | 26 Apr 2023
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 3 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Linguistic projects? This list will help you:

Project Stars
1 tatoeba2 668
2 ipa-dict 498
3 rime-cantonese 494
4 awesome-linguistics 352
5 wikipron 289
6 ichiran 278
7 prosodic 268
8 dev 108
9 TextAnnotationGraphs 89
10 odict 80
11 OpenGNT 79
12 ambuda 79
13 langstats 61
14 zeroshot_topics 60
15 tone 52
16 treebender 39
17 langua 35
18 WonderfulPolishLanguage 34
19 proiel-treebank 33
20 iso639 27
21 syn 26
22 google-books-ngram-frequency 28
23 top-open-subtitles-sentences 17

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com