text-mining

Top 23 text-mining Open-Source Projects

  • awesome-nlp

    :book: A curated list of resources dedicated to Natural Language Processing (NLP)

  • textract

    extract text from any document. no muss. no fuss.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • texthero

    Text preprocessing, representation and visualization from zero to hero.

  • trafilatura

    Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

  • Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14

    The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features

    Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.

  • scattertext

    Beautiful visualizations of how language differs among document types.

  • DataScienceR

    a curated list of R tutorials for Data Science, NLP and Machine Learning

  • tidytext

    Text mining using tidy tools :sparkles::page_facing_up::sparkles:

  • Project mention: Erik Prince, Founder of Blackwater, Faces Indictment in Austria for Trafficking Arms to Libya in Violation of UN Arms Embargo | /r/law | 2023-06-17

    Others have explored Trump’s timeline and noticed this tends to hold up- and Trump himself does indeed tweet from a Samsung Galaxy. But how could we examine it quantitatively? I’ve been writing about text mining and sentiment analysis recently, particularly during my development of the tidytext R package with Julia Silge, and this is a great opportunity to apply it again.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • rake-nltk

    Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.

  • Project mention: Need a better way to organize documents in massive nas | /r/DataHoarder | 2023-07-06

    documents, pdf's - https://opensemanticsearch.org

  • LDAvis

    R package for web-based interactive topic model visualization.

  • awesome-sentiment-analysis

    Repository with all what is necessary for sentiment analysis and related areas

  • awesome-computational-social-science

    A list of awesome resources for Computational Social Science

  • awesome-bioie

    🧫 A curated list of resources relevant to doing Biomedical Information Extraction (including BioNLP)

  • Project mention: Snomed CT Entity Linking Challenge | news.ycombinator.com | 2023-12-22

    > The objective of this competition is to link spans of text in clinical notes with specific topics in the SNOMED CT clinical terminology. Participants will train models based on real-world doctor's notes which have been de-identified and annotated with SNOMED CT concepts by medically trained professionals. This is the largest publicly available dataset of labelled clinical notes, and you can be one of the first to use it!

    NER: Named Entity Recognition: https://en.wikipedia.org/wiki/Named-entity_recognition

    awsome-medical-coding-nlp: https://github.com/acadTags/Awesome-medical-coding-NLP

    awesome-ehr-deep-learning: https://github.com/hurcy/awesome-ehr-deeplearning

    awesome-ner: https://github.com/smiyawaki0820/awesome-ner

    awesome-bioie > Research groups: https://github.com/caufieldjh/awesome-bioie#groups-active-in...

    SNOMED-CT as RDF: https://sphn-semantic-framework.readthedocs.io/en/latest/ext...

  • awesome-hungarian-nlp

    A curated list of NLP resources for Hungarian

  • converse

    Conversational text Analysis using various NLP techniques

  • textfeatures

    👷‍♂️ A simple package for extracting useful features from character objects 👷‍♀️

  • extractnet

    A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

  • huspacy

    HuSpaCy: industrial-strength Hungarian natural language processing

  • R-text-data

    List of textual data sources to be used for text mining in R

  • trrex

    Efficient string matching with regular expressions (by mesejo)

  • orange3-text

    🍊 :page_facing_up: Text Mining add-on for Orange3

  • Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07
  • intertext

    Detect and visualize text reuse

  • Project mention: Ask HN: Tool to find text reuse, similar paragraphs, fuzzy/near dupes in folder? | news.ycombinator.com | 2023-06-25

    Do you know of any too that I can use to compare my own notes and documents vault in search for copied paragraphs or almost similar phrases? Normal diffing/hashing wouldn't work as we're talking about the contents of slightly modified documents, and the comparison of each file against all others.

    I found the following tools that seem related yet not quite there, maybe I'm missing a particular term of art?

    https://github.com/YaleDHLab/intertext

  • odinson

    Odinson is a powerful and highly optimized open-source framework for rule-based information extraction. Odinson couples a simple, yet powerful pattern language that can operate over multiple representations of text, with a runtime system that operates in near real time.

  • Project mention: [D] Finetuning for text extraction (e.g. scientific sources) | /r/MachineLearning | 2023-06-11

    Odinson

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

text-mining related posts

Index

What are some of the best open-source text-mining projects? This list will help you:

Project Stars
1 awesome-nlp 15,977
2 textract 3,778
3 texthero 2,864
4 trafilatura 2,778
5 scattertext 2,197
6 DataScienceR 1,959
7 tidytext 1,158
8 rake-nltk 1,034
9 open-semantic-search 905
10 LDAvis 551
11 awesome-sentiment-analysis 526
12 awesome-computational-social-science 458
13 awesome-bioie 300
14 awesome-hungarian-nlp 205
15 converse 176
16 textfeatures 167
17 extractnet 162
18 huspacy 147
19 R-text-data 140
20 trrex 134
21 orange3-text 124
22 intertext 110
23 odinson 66

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com