gpt-3 vs cdx-index-client

gpt-3

GPT-3: Language Models are Few-Shot Learners (by openai)

Suggest topics

DISCONTINUED

Suggest alternative

Edit details

cdx-index-client

A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/ (by ikreymer)

Suggest topics

Source Code

Suggest alternative

Edit details

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

gpt-3		cdx-index-client
	Project
39	Mentions	1
9,406	Stars	171
-	Growth	-
3.5	Activity	10.0
over 3 years ago	Latest Commit	over 5 years ago
	Language	Python
-	License	MIT License

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

gpt-3

Posts with mentions or reviews of gpt-3. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-04-10.

Can ChatGPT improve my L2 grammar?
1 project | /r/AIinLanguageEducation | 4 Dec 2023

Are generative AI models useful for learning a language, and if so which languages? Over 90% of ChatGPT's training data was in English. The remaining 10% of data was split unevenly between 100+ languages. This suggests that the quality of the outputs will vary from language to language.
GPT4 Can’t Ace MIT
1 project | news.ycombinator.com | 18 Jun 2023

I have doubts it was extensively trained on German data. Who knows about GPT4, but GPT3 is ~92% of English and ~1.5% of German, which means it saw more "die, motherfucker, die" than on "die Mutter".
(https://github.com/openai/gpt-3/blob/master/dataset_statisti...)
Necesito ayuda.
1 project | /r/devsarg | 28 May 2023
[R] PaLM 2 Technical Report
1 project | /r/MachineLearning | 10 May 2023

Catalan was 0.018 % of GPT-3's training corpus. https://github.com/openai/gpt-3/blob/master/dataset_statistics/languages_by_word_count.csv.
I'm seriously concerned that if I lost ChatGPT-4 I would be handicapped
1 project | /r/ChatGPT | 25 Apr 2023
The responses I got from bard after asking why 100 times… he was pissed 😂
1 project | /r/ChatGPT | 15 Apr 2023
BharatGPT: India's Own ChatGPT
1 project | news.ycombinator.com | 13 Apr 2023

>Certainly it is pleasing that they are not just doing Hindi, but some of these languages must be represented online by a very small corpus of text indeed. I wonder how effectively an LLM can be trained on such a small training set for any given language?
as long as it's not the main language it doesn't really matter. Besides English(92.6%), the biggest language by representation (word count) is taken up by french at 1.8%. Most of the languages GPT-3 knows are sitting at <0.2% representation.
https://github.com/openai/gpt-3/blob/master/dataset_statisti...
Competence in the main language will bleed into the rest.
GPT-4 gets a B on Scott Aaronson's quantum computing final exam
1 project | /r/Physics | 12 Apr 2023
[D] Dumb question: is GPT3 model open-sourced?
2 projects | /r/MachineLearning | 10 Apr 2023

And from skimming their GH page, it seems it'd be costly to host as well
ChatGPT and the Daily Question Thread, re-evaluated with GPT-4.
1 project | /r/LearnJapanese | 14 Mar 2023

cdx-index-client

Posts with mentions or reviews of cdx-index-client. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-04-11.

DeepMind’s New Language Model,Chinchilla(70B Parameters),Which Outperforms GPT-3
3 projects | news.ycombinator.com | 11 Apr 2022

Common Crawl actually does not contain Twitter, you can go check the indexes https://github.com/ikreymer/cdx-index-client . Twitter is extremely aggressive about scraping/caching, and I guess that blocks CC. Models like GPT-3 still know a decent amount of Twitter material, and I figure that this is due to tweets being excerpts or mirrored manually in non-Twitter.com URLs (eg all the Twitter-mirroring bots on Reddit).

What are some alternatives?

When comparing gpt-3 and cdx-index-client you can also consider the following projects:

dalle-mini - DALL·E Mini - Generate images from a text prompt

mup - maximal update parametrization (µP)

DALL-E - PyTorch package for the discrete VAE used for DALL·E.

DALLE-mtf - Open-AI's DALL-E for large scale training in mesh-tensorflow.

stylegan2-pytorch - Simplest working implementation of Stylegan2, state of the art generative adversarial network, in Pytorch. Enabling everyone to experience disentanglement

v-diffusion-pytorch - v objective diffusion inference code for PyTorch.

dalle-2-preview

tensorrtx - Implementation of popular deep learning networks with TensorRT network definition API

gpt-2 - Code for the paper "Language Models are Unsupervised Multitask Learners"

jukebox - Code for the paper "Jukebox: A Generative Model for Music"

automl - Google Brain AutoML

bevy_retro - Plugin pack for making 2D games with Bevy