The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 23 text-mining Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
open-semantic-search
Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
-
awesome-sentiment-analysis
Repository with all what is necessary for sentiment analysis and related areas
-
awesome-bioie
🧫 A curated list of resources relevant to doing Biomedical Information Extraction (including BioNLP)
-
extractnet
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package
-
odinson
Odinson is a powerful and highly optimized open-source framework for rule-based information extraction. Odinson couples a simple, yet powerful pattern language that can operate over multiple representations of text, with a runtime system that operates in near real time.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
Project mention: Erik Prince, Founder of Blackwater, Faces Indictment in Austria for Trafficking Arms to Libya in Violation of UN Arms Embargo | /r/law | 2023-06-17Others have explored Trump’s timeline and noticed this tends to hold up- and Trump himself does indeed tweet from a Samsung Galaxy. But how could we examine it quantitatively? I’ve been writing about text mining and sentiment analysis recently, particularly during my development of the tidytext R package with Julia Silge, and this is a great opportunity to apply it again.
Project mention: Need a better way to organize documents in massive nas | /r/DataHoarder | 2023-07-06documents, pdf's - https://opensemanticsearch.org
> The objective of this competition is to link spans of text in clinical notes with specific topics in the SNOMED CT clinical terminology. Participants will train models based on real-world doctor's notes which have been de-identified and annotated with SNOMED CT concepts by medically trained professionals. This is the largest publicly available dataset of labelled clinical notes, and you can be one of the first to use it!
NER: Named Entity Recognition: https://en.wikipedia.org/wiki/Named-entity_recognition
awsome-medical-coding-nlp: https://github.com/acadTags/Awesome-medical-coding-NLP
awesome-ehr-deep-learning: https://github.com/hurcy/awesome-ehr-deeplearning
awesome-ner: https://github.com/smiyawaki0820/awesome-ner
awesome-bioie > Research groups: https://github.com/caufieldjh/awesome-bioie#groups-active-in...
SNOMED-CT as RDF: https://sphn-semantic-framework.readthedocs.io/en/latest/ext...
Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07
Project mention: Ask HN: Tool to find text reuse, similar paragraphs, fuzzy/near dupes in folder? | news.ycombinator.com | 2023-06-25Do you know of any too that I can use to compare my own notes and documents vault in search for copied paragraphs or almost similar phrases? Normal diffing/hashing wouldn't work as we're talking about the contents of slightly modified documents, and the comparison of each file against all others.
I found the following tools that seem related yet not quite there, maybe I'm missing a particular term of art?
https://github.com/YaleDHLab/intertext
Project mention: [D] Finetuning for text extraction (e.g. scientific sources) | /r/MachineLearning | 2023-06-11Odinson
text-mining related posts
- Erik Prince, Founder of Blackwater, Faces Indictment in Austria for Trafficking Arms to Libya in Violation of UN Arms Embargo
- [D] Finetuning for text extraction (e.g. scientific sources)
- Szoláris Magyar: Az elmúlt négy hónapban egy morfológia alapú alternatív írásrendszeren dolgoztam, ami a magyar nyelvre illeszkedik (további infó kommentekben)
- [Q] Does anyone use R to code qualitative data?
- Stream processing. The logic is encoded in the input pattern.
- Stream processing. The logic is encoded in the input pattern.
- How to give a file path to a file parser when you only have an HTTPRequest?
-
A note from our sponsor - WorkOS
workos.com | 25 Apr 2024
Index
What are some of the best open-source text-mining projects? This list will help you:
Project | Stars | |
---|---|---|
1 | awesome-nlp | 15,977 |
2 | textract | 3,778 |
3 | texthero | 2,864 |
4 | trafilatura | 2,778 |
5 | scattertext | 2,197 |
6 | DataScienceR | 1,959 |
7 | tidytext | 1,158 |
8 | rake-nltk | 1,034 |
9 | open-semantic-search | 905 |
10 | LDAvis | 551 |
11 | awesome-sentiment-analysis | 526 |
12 | awesome-computational-social-science | 458 |
13 | awesome-bioie | 300 |
14 | awesome-hungarian-nlp | 205 |
15 | converse | 176 |
16 | textfeatures | 167 |
17 | extractnet | 162 |
18 | huspacy | 147 |
19 | R-text-data | 140 |
20 | trrex | 134 |
21 | orange3-text | 124 |
22 | intertext | 110 |
23 | odinson | 66 |
Sponsored