Top 15 HTML NLP Projects

unstructured

12 5,750 9.8 HTML

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Project mention: LlamaCloud and LlamaParse | news.ycombinator.com | 2024-02-20

Be careful with unstructured:
https://github.com/Unstructured-IO/unstructured/blob/d11c70c...
from: https://github.com/open-webui/open-webui/issues/687

bootcamp

24 1,606 9.1 HTML

Dealing with all unstructured data, such as reverse image search, audio search, molecular search, video analysis, question and answer systems, NLP, etc. (by milvus-io)

Project mention: FLaNK AI - 01 April 2024 | dev.to | 2024-04-01

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
datefinder

3 625 0.0 HTML

Find dates inside text using Python and get back datetime objects

Project mention: Sneller Regex vs Ripgrep | news.ycombinator.com | 2023-05-18

That's with DFA minimization. Also, '\w' has 311 states while '(?-u)\w' has 5 states.
I don't have a precise definition of enormous or impractical. Does it matter? I suppose one obvious one is when DFA construction time starts having a significant impact on total search times.
> Additionally, the results are not the same: the number of matches is not equal to 7882. How could I make `\w` conform to other regex implementations in ripgrep?
By following UTS#18: https://unicode.org/reports/tr18/#word
Most regex engines make \w be ASCII-only by default. But most also have a way to opt into Unicode-aware mode. RE2, Go's regexp and ECMAScript are popular regex engines that have no way to change the interpretation of \w.
> Fair question how regex compilers handle nefarious regexes. Go does not handle NFA with more than 1000 states, and, as you observed, we added some more restrictions when processing the NFA. It can be an interesting academic exercise to find monstrous regexes, but we haven't encountered useful regexes that hit these limits. But I guess you know some...
It's definitely not academic. People use regexes for lexers. People use big regexes to recognize certain things like email addresses and dates. Here's a real regex used in real software to identify dates in unstructured text for example: https://github.com/akoumjian/datefinder/blob/5376ece0a522c44...
Otherwise, as I hinted at above, the thing that can make regexes very large very quickly is when you mix Unicode classes with counted repetitions. It doesn't take a lot to make them "big."

Sherlock

2 517 2.0 HTML

Natural-language event parser for Javascript (by neilgupta)
Giveme5W1H

1 500 0.0 HTML

Extraction of the journalistic five W and one H questions (5W1H) from news articles: who did what, when, where, why, and how?
awesome-python

5 232 8.4 HTML

🐍 Hand-picked awesome Python libraries and frameworks, organised by category (by dylanhogg)

Project mention: Discover Awesome Python projects | /r/Python | 2023-05-21

To be transparent, the actual scoring algorithm used on the site can be viewed here, and the data with all source features is available also if you want play with it.

botfuel-dialog

1 101 0.0 HTML

Botfuel SDK to build highly conversational chatbots
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
rgpt3

1 94 5.7 HTML

Making requests from R to the GPT models

Project mention: How do I get GPT to look through hundreds of pages of PDFs? | /r/OpenAI | 2023-04-24

It has a community-maintained library.

stripnet

1 85 0.0 HTML

STriP Net: Semantic Similarity of Scientific Papers (S3P) Network
speaking_with_plato

2 8 10.0 HTML

Exploring Plato's philosophy with AI - A Data Spiral blog article
go-htmldate

1 4 5.6 HTML

CLI and Go package for extracting publication date of a web pages.
Conversations

1 3 0.0 HTML

A chat-bot that is community-driven and open source – powered by you! (WIP) (by MarketingPipeline)
datalabel

1 2 10.0 HTML

datalabel is a UI-based data editing tool that makes it easy to create labeled text data in a dataframe. With datalabel, you can quickly and effortlessly edit your data without having to write any code. Its intuitive interface makes it ideal for both experienced data professionals and those new to data editing.
hanakotoba

1 1 3.3 HTML

Exploring 花言葉 in Japanese and other literary corpora

Project mention: Evidence – Business Intelligence as Code | news.ycombinator.com | 2023-04-20

Thanks for sharing, that looks great! That Datapane example is more a hello world web app that runs Python code in the backend (so requires a backend server - Fly.io in this example).
An example of a standalone report would be something like this, from one of our users: https://cloud.datapane.com/reports/dkjbvwk/literature-in-blo... (code: https://github.com/ryancahildebrandt/hanakotoba) or https://cloud.datapane.com/reports/aAMaqoA/when-fact-is-fals...

nfl-prospects-nlp

1 1 3.7 HTML

Sentiment analysis and text generation of NFL prospect scouting reports 2014-2022
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-01.

HTML NLP related posts

Unstructured – OSS libraries and APIs to build custom preprocessing pipelines
1 project | news.ycombinator.com | 10 Jul 2023
More intelligent Pdf parsers
1 project | /r/LocalLLaMA | 15 Jun 2023
Help extracting data from multiple PDF's
1 project | /r/datascience | 6 Jun 2023
Any way to convert my handwritten diary to searchable PDFs?
2 projects | /r/linuxquestions | 27 May 2023
Pre-processing text documents such as PDFs, HTML and Word Documents for LLMs
1 project | news.ycombinator.com | 24 May 2023
How can I convert restaurant’s traditional menu in pdf file to well structured list of menu items with prices in Excel file? Thank you
1 project | /r/ChatGPTCoding | 23 Apr 2023
There is any other simple library for create agents // apps with LLM
2 projects | /r/LangChain | 21 Apr 2023
A note from our sponsor - SaaSHub
www.saashub.com | 19 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source NLP projects in HTML? This list will help you:

	Project	Stars
1	unstructured	5,750
2	bootcamp	1,606
3	datefinder	625
4	Sherlock	517
5	Giveme5W1H	500
6	awesome-python	232
7	botfuel-dialog	101
8	rgpt3	94
9	stripnet	85
10	speaking_with_plato	8
11	go-htmldate	4
12	Conversations	3
13	datalabel	2
14	hanakotoba	1
15	nfl-prospects-nlp	1