snorkel
nodevectors
Our great sponsors
snorkel | nodevectors | |
---|---|---|
5 | 8 | |
5,707 | 487 | |
0.8% | - | |
5.5 | 0.0 | |
about 2 months ago | over 1 year ago | |
Python | Python | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
snorkel
-
[P] We are building a curated list of open source tooling for data-centric AI workflows, looking for contributions.
The paid product came out of an open source tool: https://github.com/snorkel-team/snorkel
- [Discussion] - "data sourcing will be more important than model building in the era of foundational model fine-tuning"
-
Can't use load_data from utils
Actually, I referenced it in my issue as well. There seems to be different utils.py file in different folders under the snorkel-tutorials repo but the utils file we get after importing snorkel has a different [file](https://github.com/snorkel-team/snorkel/blob/master/snorkel/utils/core.py) ,i.e. the utils file is different in the main snorkel repo
- [D] A hand-picked selection of the best Python ML Libraries of 2021
-
[Discussion] Methods for enhancing high-quality dataset A with low-quality dataset
Snorkel (https://github.com/snorkel-team/snorkel) might provide you exactly what you are looking for. From the docs:
nodevectors
-
Vectorizing Graph Neural Networks
Yes, people working on graph based ML realize quickly that the underlying data structures most originally academic libraries (networkX, PyG, etc.) use are bad.
I wrote about this before [1] and based a node embedding library around the concept [2].
The NetworkX style graphs are laid out as a bunch of items in a heap with pointers to each other. That works at extreme scales, because everything is on a cluster's RAM and you don't mind paying the latency costs of fetch operations. But it makes little sense for graphs with < 5B nodes to be honest.
Laying out the graph as a CSR sparse matrix makes way more sense because of data locality. At larger scales, you could just leave the CSR array data on NVMe drives, and you'd still operate at 500mb/s random query throughput with hand coded access, ~150mb/s with mmap. That remains to be implemented by someone.
[1] https://www.singlelunch.com/2019/08/01/700x-faster-node2vec-...
[2] https://github.com/VHRanger/nodevectors
-
Zoomable, animated scatterplots in the browser that scales over a billion points
Ideally, you'd embed the graph into 2 or 3d first, then visualize it as a scatterplot.
Visualizing the edges at scale doesnt yield nice results in general.
The way to do it is to reduce the graph to some 300d or 500d embeddings, then use TSNE/UMAP/PACMAP to reduce that to 3d. Then visualize.
My prefered way is to use some first order embedding method like GGVec in this library [1] (disclaimer I wrote it). Node2Vec and ProNE don't yield great embeddings for visualization (the first is too filamented, the second too close to the unit ball).
Another great library to do this work is GRAPE [2]. Try first-order embedding methods, or short walks on second order methods to avoid the embeddings being too filamented by long random walk sampling.
[1] https://github.com/VHRanger/nodevectors
[2] https://github.com/AnacletoLAB/grape/
-
[P] We are building a curated list of open source tooling for data-centric AI workflows, looking for contributions.
For graph embeddings, there's quite a few. I'd recommend this one, but there's also this one (disclaimer: I'm the author) or this one, more of a DGL library.
-
clustering on sparse data (that's also wide)
You could also use some node embedding library to embed the sparse matrix into a denser one and then cluster that.
-
Faster Python calculations with Numba: 2 lines of code, 13× speed-up
Numba fits very few usecases, but where it does fit it's awesome.
I've been using it in a python graph library to write graph traversal routines and it's done me very well: https://github.com/VHRanger/nodevectors
The best part is the native openMP support on for loops IMO. Makes parallelism in data work very efficient compared to python alternatives that use processes (instead of threads)
-
UMAP works by representing high-dimensional data as a weighted graph and projecting that graph in lower dimensions. Could you use it directly to visualize a graph?
I was playing around with graph embeddings (https://github.com/VHRanger/nodevectors/) and wanted to visualize them, which led me to look into UMAP.
-
[D] Best methods for imbalanced multi-class classification with high dimensional, sparse predictors
The best candidates for it would be UMAP or graph embedding methods
-
Why I'm Lukewarm on Graph Neural Networks
As expected, networkx couldn't handle more than a million nodes so I had to search for python libs which might handle that much data.
This is why I've been using your lib (https://github.com/VHRanger/nodevectors) for at least 2 weeks now as well as these 2 other libs: https://github.com/louisabraham/fastnode2vec and https://github.com/sknetwork-team/scikit-network. What do they have in common? They handle sparse graphs (using CSR representations).
Having a graph with several million nodes isn't just some edge case, social graph for instance grow way faster than anyone could expect.
What are some alternatives?
skweak - skweak: A software toolkit for weak supervision applied to NLP tasks
ndarray_comparison - Benchmark of toy calculation on an n-dimensional array using python, numba, cython, pythran and rust
argilla - Argilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.
deepscatter - Zoomable, animated scatterplots in the browser that scales over a billion points
spaCy - 💫 Industrial-strength Natural Language Processing (NLP) in Python
GCGT - Source code for the paper: GPU-based Compressed Graph Traversal
weasel - Weakly Supervised End-to-End Learning (NeurIPS 2021)
refinery - The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
caer - High-performance Vision library in Python. Scale your research, not boilerplate.
CloudForest - Ensembles of decision trees in go/golang.
pytorch-lightning - Build high-performance AI models with PyTorch Lightning (organized PyTorch). Deploy models with Lightning Apps (organized Python to build end-to-end ML systems). [Moved to: https://github.com/Lightning-AI/lightning]
nanocube