Show HN: Cedille, the largest French language model, released in open source

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

cedille-ai

9 201 0.0

✒️ Cedille is a large French language model (6B), released under an open-source license

We are excited to announce Cedille, the largest language model for French (6b parameters).
Demo: https://cedille.ai
Language models are general purpose AI systems that are able to solve a range of tasks by simply being prompted for it. It can be used for example to summarize text, do translations, or for idea generation & overcoming writer's block.
You may know GPT-3, the humongous model from OpenAI. Cedille is a similar model targeting the French demographic - but smaller, as we don’t yet have $1b in the bank like they do. Although GPT-3 supports multiple languages including French, our model is competitive with GPT-3 on a range of French tasks! Plus, of course we’re open source while they keep their model closed and heavily restrict access to it.
You can try it out right away from our playground: https://app.cedille.ai
We are proponents of “open AI” and as such have released a checkpoint for the world to use (MIT license): https://github.com/coteries/cedille-ai
One of the problems with large language models is the potentially toxic, sexist or in other ways unpleasant output. We tried our best to avoid this issue by doing extensive dataset filtering. As a result, our benchmark indicates that Cedille is indeed less toxic than GPT-3.

detoxify

4 839 6.2 Python

Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at [email protected].

Yeah, this kind of toxic output sadly still can happen :-/
We have fully analyzed the training dataset (1128 GB) using Detoxify (https://github.com/unitaryai/detoxify) to filter out problematic content. But of course detecting toxicity is a tough challenge in itself, so this process is imperfect at best.
We are using the RealToxicityPrompt framework (https://realtoxicityprompts.apps.allenai.org/) to analyse how toxic our models are and to steer our efforts in this direction. This means we are generating thousands of completions and analysing them to see how "nasty" the model is. We plan to write more on this topic soon.
But yeah, this is definitely far from being a solved problem, and our model (as well as all large language models) should be handled with care.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Cedille, the largest French language model , released in open source

4 projects | /r/france | 10 Nov 2021
ML Discord Moderation Bot

1 project | /r/DiscordModeration | 2 Aug 2022
[D] Are there attempts at a large German-language LM?

1 project | /r/MachineLearning | 5 Apr 2021
Haystack DB – 10x faster than FAISS with binary embeddings by default

3 projects | news.ycombinator.com | 28 Apr 2024
Pen.el – Emacs-based operating system designed with holiness in mind

1 project | news.ycombinator.com | 25 Apr 2024

Show HN: Cedille, the largest French language model, released in open source

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
NLP Bert bert-model Nlg huggingface-transformers
Post date: 10 Nov 2021

cedille-ai

detoxify

InfluxDB

Related posts

Cedille, the largest French language model , released in open source

ML Discord Moderation Bot

[D] Are there attempts at a large German-language LM?

Haystack DB – 10x faster than FAISS with binary embeddings by default

Pen.el – Emacs-based operating system designed with holiness in mind

Show HN: Cedille, the largest French language model, released in open source

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com NLP Bert bert-model Nlg huggingface-transformers Post date: 10 Nov 2021

cedille-ai

detoxify

InfluxDB

Related posts

Cedille, the largest French language model , released in open source

ML Discord Moderation Bot

[D] Are there attempts at a large German-language LM?

Haystack DB – 10x faster than FAISS with binary embeddings by default

Pen.el – Emacs-based operating system designed with holiness in mind

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
NLP Bert bert-model Nlg huggingface-transformers
Post date: 10 Nov 2021