bert
compare-go-json
Our great sponsors
bert | compare-go-json | |
---|---|---|
49 | 5 | |
36,766 | 18 | |
1.2% | - | |
0.0 | 0.0 | |
5 months ago | almost 2 years ago | |
Python | Go | |
Apache License 2.0 | - |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
bert
-
Integrate LLM Frameworks
The release of BERT in 2018 kicked off the language model revolution. The Transformers architecture succeeded RNNs and LSTMs to become the architecture of choice. Unbelievable progress was made in a number of areas: summarization, translation, text classification, entity classification and more. 2023 tooks things to another level with the rise of large language models (LLMs). Models with billions of parameters showed an amazing ability to generate coherent dialogue.
-
Embeddings: What they are and why they matter
The general idea is that you have a particular task & dataset, and you optimize these vectors to maximize that task. So the properties of these vectors - what information is retained and what is left out during the 'compression' - are effectively determined by that task.
In general, the core task for the various "LLM tools" involves prediction of a hidden word, trained on very large quantities of real text - thus also mirroring whatever structure (linguistic, syntactic, semantic, factual, social bias, etc) exists there.
If you want to see how the sausage is made and look at the actual algorithms, then the key two approaches to read up on would probably be Mikolov's word2vec (https://arxiv.org/abs/1301.3781) with the CBOW (Continuous Bag of Words) and Continuous Skip-Gram Model, which are based on relatively simple math optimization, and then on the BERT (https://arxiv.org/abs/1810.04805) structure which does a conceptually similar thing but with a large neural network that can learn more from the same data. For both of them, you can either read the original papers or look up blog posts or videos that explain them, different people have different preferences on how readable academic papers are.
-
Ask HN: How to Break into AI Engineering
Could you post a link to "the BERT paper"? I've read some, but would be interested reading anything that anyone considered definitive :) Is it this one? "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" :https://arxiv.org/abs/1810.04805
-
How to leverage the state-of-the-art NLP models in Rust
Rust crate rust_bert implementation of the BERT language model (https://arxiv.org/abs/1810.04805 Devlin, Chang, Lee, Toutanova, 2018). The base model is implemented in the bert_model::BertModel struct. Several language model heads have also been implemented, including:
-
List of AI-Models
Click to Learn more...
-
What were the 40 research papers on the list Ilya Sutskever gave John Carmack?
6. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (2018) - https://arxiv.org/abs/1810.04805 (Google)
-
Train a language model from scratch
The BERT paper has all the information regarding training parameters and datasets used. Hugging Face Datasets hosts the bookcorpus and wikipedia datasets.
- I'm noticing a huge uprising of hostility against AI generated art lately. But where's the threat?
- AlphaCode by DeepMind
-
[R] LiBai: a large-scale open-source model training toolbox
Found relevant code at https://github.com/google-research/bert + all code implementations here
compare-go-json
-
The fastest tool for querying large JSON files is written in Python (benchmark)
For me OjG (https://github.com/ohler55/ojg) has been great. I regularly use it on files that can not be loaded into memory. The best JSON file format for multiple record is one JSON document per record all in the same file. OjG doesn't care if they are on different lines. It is fast (https://github.com/ohler55/compare-go-json) and uses a fairly complete JSONPath implementation for searches. Similar to jq but using JSONPath instead of a proprietary query language.
I am biased though as I wrote OjG to handle what other tools were not able to do.
-
OjG now has a tokenizer that is almost 10 times faster than json.Decode
jsoniter is json-iterator/go. It is the 3rd column at https://github.com/ohler55/compare-go-json
The title says it all. The new tokenizer here https://github.com/ohler55/ojg and the benchmarks and comparison to other JSON packages is here https://github.com/ohler55/compare-go-json.
You'll find some examples in second link that does the benchmark. https://github.com/ohler55/compare-go-json.
What are some alternatives?
NLTK - NLTK Source
jsoniter - A high-performance 100% compatible drop-in replacement of "encoding/json"
bert-sklearn - a sklearn wrapper for Google's BERT model
orjson - Fast, correct Python JSON library supporting dataclasses, datetimes, and numpy
transformers - 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
pysimilar - A python library for computing the similarity between two strings (text) based on cosine similarity
PURE - [NAACL 2021] A Frustratingly Easy Approach for Entity and Relation Extraction https://arxiv.org/abs/2010.12812
NL_Parser_using_Spacy - NLP parser using NER and TDD
cakechat - CakeChat: Emotional Generative Dialog System
word2vec-slim - word2vec Google News model slimmed down to 300k English words
msgspec - A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
comparePlus - Compare plugin for Notepad++