pybind11
tokenizers
Our great sponsors
pybind11 | tokenizers | |
---|---|---|
42 | 8 | |
14,800 | 8,424 | |
2.1% | 3.1% | |
8.6 | 8.5 | |
7 days ago | 3 days ago | |
C++ | Rust | |
GNU General Public License v3.0 or later | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
pybind11
-
Experience using crow as web server
I'm investigating using C++ to build a REST server, and would love to know of people's experiences with Crow-- or whether they would recommend something else as a "medium-level" abstraction C++ web server. As background, I started off experimenting with Python/FastAPI, which is great, but there is too much friction to translate from pybind11-exported C++ objects to the format that FastAPI expects, and, of course, there are inherent performance limitations using Python, which could impact scaling up if the project were to be successful.
- Swig – Connect C/C++ programs with high-level programming languages
-
returning numpy arrays via pybind11
I have a C++ function computing a large tensor which I would like to return to Python as a NumPy array via pybind11.
-
I created smooth_lines python module, great for drawing software
This is based on the Google Ink Stroke Modeler C++ library, and using pybind11 to make it available on python.
-
Facial Landmark Detection with C++
pybind11 makes it easy to call C++ from Python if you want to mix.
-
Python’s Multiprocessing Performance Problem
If you've never used Pybind before these pybind tests[1] and this repo[2] have good examples you can crib to get started (in addition to the docs). Once you handle passing/returning/creating the main data types (list, tuple, dict, set, numpy array) the first time, then it's mostly smooth sailing.
Pybind offers a lot of functionality, but core "good parts" I've found useful are (a) use a numpy array in Python and pass it to a C++ method to work on, (b) pass your python data structure to pybind and then do work on it in C++ (some copy overhead), and (c) Make a class/struct in C++ and expose it to Python (so no copying overhead and you can create nice cache-aware structs, etc.).
[1] https://github.com/pybind/pybind11/blob/master/tests/test_py...
- Making Python Web Application with C++ Backend
-
Using pybind11 with minGW to cross compile pyhton module for Windows
I have a python module for which the logic is written in C++ and I use pybind11 to expose the objects and functions to Python.
-
IPC communication between rust, c++, and python
Reading from Python requires a wrapper, using pybind11 this is fairly done.
-
[ADVICE] Python to C++
Also I can highly recommend starting using C++ to augment your Python code, i.e. find the parts that are slow or undoable in Python and write those in C++ then expose them as Python functions. You can use https://github.com/pybind/pybind11 to call C++ code from Python.
tokenizers
-
HF Transfer: Speed up file transfers
Hugging Face seems to like Rust. They also wrote Tokenizers in Rust.
-
LLM custom dictionary
Your intuition is right. There are two ways (in increasing order of result performance) : 1. You can simply extend vocab file of the tokenizer and test the predictions 2. You can extend the vocab file and re-train your model on custom data which has these new tokens. Check the following issue on GitHub : https://github.com/huggingface/tokenizers/issues/247
-
[D] SentencePiece, WordPiece, BPE... Which tokenizer is the best one?
SentencePiece -> implementation of some algorithms (there are several others, https://github.com/microsoft/BlingFire https://github.com/glample/fastBPE https://github.com/huggingface/tokenizers )
-
Portability of Rust in 2021
In sum I would like the idea to go with Rust as I more or less got to rewrite the whole thing anyway, but I am a bit skeptical if I will be able to interface with everything that might come up at some point. Or probably end up in a wrapper hell if I got to use more C++ libraries. On the other hand there are definitely a few Rust projects out there that might come in handy (for example https://github.com/huggingface/tokenizers). And the build process is pretty awful right now (CMake it is but with lots of hacks).
-
[D] What's going to be the dominant language for machine learning in 5 years?
A full machine learning pipeline usually comprises far more than just the model, and this is the area where Rust may shine (the recent work by HuggingFace and their https://github.com/huggingface/tokenizers library is a good example)
-
substitute for tokenizer in torchtext
As for other tokenizers, you can take a look at - Huggingface tokenizers library: https://github.com/huggingface/tokenizers - NLTK tokenize: https://www.nltk.org/api/nltk.tokenize.html - Polygot: https://pypi.org/project/polyglot/
-
PyO3: Rust Bindings for the Python Interpreter
Huggingface Tokenizers (https://github.com/huggingface/tokenizers), which are now used by default in their Transformers Python library, use pyO3 and became popular due to the pitch that it encoded text an order of magnitude faster with zero config changes.
It lives up to that claim. (I had issues with return object typing when going between Python/Rust at first but those are more consistent now)
-
Rusticles #19 - Wed Nov 11 2020
huggingface/tokenizers (Rust): 💥Fast State-of-the-Art Tokenizers optimized for Research and Production
What are some alternatives?
PyO3 - Rust bindings for the Python interpreter
onnx-tensorflow - Tensorflow Backend for ONNX
nanobind - nanobind: tiny and efficient C++/Python bindings
onnxruntime - ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Optional Argument in C++ - Named Optional Arguments in C++17
setuptools-rust - Setuptools plugin for Rust support
BlingFire - A lightning fast Finite State machine and REgular expression manipulation library.
sol2 - Sol3 (sol2 v3.0) - a C++ <-> Lua API wrapper with advanced features and top notch performance - is here, and it's great! Documentation:
rayon - Rayon: A data parallelism library for Rust
PEGTL - Parsing Expression Grammar Template Library
tch-rs - Rust bindings for the C++ api of PyTorch.