Show HN: I made a tool to clean and convert any webpage to Markdown

Our great sponsors

SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

Our great sponsors

pandoc

420 32,396 9.8 Haskell

Universal markup converter

This is one of those things that the ever-amazing pandoc (https://pandoc.org/) does very well, on top of supporting virtually every other document format.

python-readability

5 2,563 3.4 Python

fast python port of arc90's readability tool, updated to match latest readability.js!

One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability

SurveyJS

surveyjs.io sponsored

Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
tidy-html5

9 2,660 0.0 C

The granddaddy of HTML tools, with support for modern standards
markdown-clipper

2 1,045 10.0 JavaScript

Discontinued A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file. [Moved to: https://github.com/deathau/markdownload]
html2md

1 15 7.3 C++

Transform your HTML into clean, easy-to-read markdown with html2md. (by tim-gromeyer)

If anyone looking for a C++ solution to convert HTML to Markdown, I'm using this repo https://github.com/tim-gromeyer/html2md in my app.

llm

23 2,903 9.5 Python

Access large language models from the command-line (by simonw)

That's a great use case, you might be able to do this if you've got a copy and paste on the command line with
https://github.com/simonw/llm
In between. An alias like pdfwtf translating to "paste | llm command | copy"

webscrapbook

7 825 9.5 JavaScript

A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
scrapedown

3 77 5.7 JavaScript

A simple worker for extracting page content for a given URL

Here is an open source alternative to this tool: https://github.com/ozanmakes/scrapedown

KeenWrite

6 - -

I wrote a series of blog posts about typesetting Markdown using pandoc:
https://dave.autonoma.ca/blog
I found pandoc on its own to be a little limiting:
* Awkward to use interpolated variables within prose.
* No real-time preview prior to rendering the final document.
* Limited options for TeX support (e.g., SVG vs. inline; ConTeXt vs. LaTeX).
* Inconsistent syntax for captions and cross-references.
For my purposes, I wanted to convert variable-laden Markdown and R Markdown to text, XHTML, and PDF formats. Eventually I replaced my tool chain of yamlp + pandoc + knitr with an integrated FOSS cross-platform desktop editor.
https://keenwrite.com/
KeenWrite uses flexmark-java + Renjin to provide a solution that can replace pandoc + knitr.
Note how the captions and cross-reference syntax for images, tables, and equations is unified to use a double-colon sigil:
https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/ref...

markdownload

35 2,471 5.2 JavaScript

A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file.

This fork:
https://github.com/deathau/markdownload
With extension available for Firefox, Google Chrome, Microsoft Edge and Safari.

omnivore

67 8,924 10.0 TypeScript

Omnivore is a complete, open source read-it-later solution for people who like reading.
to-markdown

5 7,902 4.8 HTML

🛏 An HTML to Markdown converter written in JavaScript

https://mixmark-io.github.io/turndown/
With some configuration you can get most of the way there.

easy-astro-blog-creator

2 7 9.0 TypeScript

An easy personal blog template for Github Pages.

I built a no-code github blog deployer thingy that lets you deploy a blog to to github pages from a codespace. https://github.com/ShelbyJenkins/easy-astro-blog-creator
Anyways, it uses astro + markdown.
It'd be really neat if I could scrape my medium account to convert it to markdown to save me the trouble.

parser

12 5,245 1.1 JavaScript

📜 Extract meaningful content from the chaos of a web page

Thoroughly scraping is challenging, especially in an environment where you don’t have (or want) a JavaScript runtime.
For content extraction, I found the approach the Postlight library takes quite neat. It scores individual html nodes based on some heuristics (text length, link density, css classes). It the selects the nodes with the highest score. [1] I ported it to Swift for a personal read later app.
[1] https://github.com/postlight/parser

gather-cli

1 113 5.5 Swift

I've been using gather-cli[0]for this, built by the venerable Brett Terpstra.
[0] https://github.com/ttscoff/gather-cli

metascraper

6 2,234 8.9 HTML

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Show HN: Zenfetch – Turn your saved browsing content into an AI second brain
1 project | news.ycombinator.com | 24 Jan 2024
Elon Musk Fans Horrified When His Grok AI Immediately "Goes Woke"
1 project | /r/behindthebastards | 11 Dec 2023
9 years ago my crew found the oldest time capsule in US history buried in a cornerstone of the Mass. State House.
1 project | /r/Construction | 10 Dec 2023
me irl
1 project | /r/me_irl | 10 Dec 2023
Oh great, they redesigned reddit so it has a smaller display font.
1 project | /r/GenX | 8 Dec 2023

Show HN: I made a tool to clean and convert any webpage to Markdown

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
chrome-extension Markdown firefox-addon html-to-markdown firefox-extension
Post date: 14 Apr 2024

pandoc

python-readability

SurveyJS

tidy-html5

markdown-clipper

html2md

llm

webscrapbook

WorkOS

scrapedown

KeenWrite

markdownload

omnivore

to-markdown

easy-astro-blog-creator

parser

gather-cli

metascraper

InfluxDB

Related posts

Show HN: I made a tool to clean and convert any webpage to Markdown

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com chrome-extension Markdown firefox-addon html-to-markdown firefox-extension Post date: 14 Apr 2024

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
chrome-extension Markdown firefox-addon html-to-markdown firefox-extension
Post date: 14 Apr 2024