snorkel
spaCy
Our great sponsors
snorkel | spaCy | |
---|---|---|
5 | 91 | |
5,500 | 26,215 | |
0.5% | 0.9% | |
5.5 | 9.7 | |
about 1 month ago | 8 days ago | |
Python | Python | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
snorkel
-
[P] We are building a curated list of open source tooling for data-centric AI workflows, looking for contributions.
The paid product came out of an open source tool: https://github.com/snorkel-team/snorkel
- [Discussion] - "data sourcing will be more important than model building in the era of foundational model fine-tuning"
-
Can't use load_data from utils
Actually, I referenced it in my issue as well. There seems to be different utils.py file in different folders under the snorkel-tutorials repo but the utils file we get after importing snorkel has a different [file](https://github.com/snorkel-team/snorkel/blob/master/snorkel/utils/core.py) ,i.e. the utils file is different in the main snorkel repo
- [D] A hand-picked selection of the best Python ML Libraries of 2021
spaCy
-
Looking for open source projects in Machine Learning and Data Science
You could try spaCy. This is the brains of the operation - an open-source NLP library for advanced NLP in Python. Another is DocArray - It's built on top of NumPy and Dask, and good for preprocessing, modeling, and analysis of text data.
-
One does not simply "create a visualization" from unstructured data!
In this example given in the article, I can't just use SQL functions to extract the age and phone number. I guess the phone number could be regexed but ideally I should use something like spaCy and also record some kind of confidence score. This is where Spark/Dask/etc really shine. Does Airbyte support user defined functions in a language like Python?
-
Training on BERT without any 'context' just questions/answer tuples?
(1) For large scale processing/tokenizing your data I would consider using something like NLTK or Spacy. That's if your books are already in text form. If they are scans, you'll need to use some OCR software first.
-
Has anyone here ever used the seaNMF model for short text topic modeling, and be willing to help me get started with it?
Tokenize with NLTK, SpaCy or CoreNLP
-
Transforming free-form geospatial directions into addresses - SOTA?
If you've got a specific area you're looking at, and already have street data, you could: 1. Follow the ArcGis blog's directions, creating intersection features. 2. Train a classifier (or a specific NER entity type; SpaCy would be a good package for that) on the types of cross-street references you're finding in your text. You can see some of the relevant tokens in the examples you provided - "Corner of", "along", and I'd imagine "intersection of" etc. Even simple string lookups could help you bootstrap the training data. 3. Use some sort of embedding similarity to compare the hit terms to potential cross-streets.
-
Tell HN: Selling My SaaS
Great question! short answer, it doesn't.
While I did start with a vision of presbot being a self-learning chatbot built to act as an interactive agent that would represent its owner (primarily b2c) in all sorts of situations. Based on the feedback, I realized that until that interaction is smooth, believable and closer to an actual dynamic conversation, it provides much less value. I was using a combination of https://www.nltk.org/" rel="nofollow">NLTK,https://spacy.io/" rel="nofollow">spaCy and https://textblob.readthedocs.io/en/dev/" rel="nofollow">textblob for NLP then.
I pivoted to a rule-based bot focusing on lead capture via a linear conversation driven by user specified questions, more like an interactive version of a static form, with prescribed Q&A (FAQs on the platform).
-
Which not so well known Python packages do you like to use on a regular basis and why?
i work mostly in the NLP space, so other libraries i like are spaCy, nltk, and pynlp lib
-
Is it home bias or is data wrangling for machine learning in python much less intuitive and much more burdensome than in R?
Standout python NLP libraries include Spacy and Gensim, as well as pre-trained model availability in Hugginface. These libraries have widespread use in and support from industry and it shows. Spacy has best-in-class methods for pre-processing text for further applications. Gensim helps you manage your corpus of documents, and contains a lot of different tools for solving a common industry task, topic modeling.
-
How to get started with machine learning.
Given your need, I think you'll be better off with libraries like Spacy, which does NLP (rather than just DNN inference). You'll get your app much faster this way.
- There is framework for everything.
What are some alternatives?
TextBlob - Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
Stanza - Official Stanford NLP Python Library for Many Human Languages
NLTK - NLTK Source
BERT-NER - Pytorch-Named-Entity-Recognition-with-BERT
polyglot - Multilingual text (NLP) processing toolkit
textacy - NLP, before and after spaCy
Jieba - 结巴中文分词
CoreNLP - Stanford CoreNLP: A Java suite of core NLP tools.
PyTorch-NLP - Basic Utilities for PyTorch Natural Language Processing (NLP)
Pattern - Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
huggingface_hub - All the open source things related to the Hugging Face Hub.
duckling - Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.