Show HN: I made a tool to clean and convert any webpage to Markdown

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Civic Auth - Auth in Less Than 5 Minutes
Civic Auth comes with multiple SSO options, optional embedded wallets, and user management β€” all implemented with just a few lines of code. Start building today.
www.civic.com
featured
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai
featured
  1. pandoc

    Universal markup converter

    This is one of those things that the ever-amazing pandoc (https://pandoc.org/) does very well, on top of supporting virtually every other document format.

  2. Civic Auth

    Auth in Less Than 5 Minutes. Civic Auth comes with multiple SSO options, optional embedded wallets, and user management β€” all implemented with just a few lines of code. Start building today.

    Civic Auth logo
  3. python-readability

    fast python port of arc90's readability tool, updated to match latest readability.js!

    One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability

  4. tidy-html5

    The granddaddy of HTML tools, with support for modern standards

  5. markdown-clipper

    Discontinued A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file. [Moved to: https://github.com/deathau/markdownload]

  6. html2md

    Transform your HTML into clean, easy-to-read markdown with html2md. (by tim-gromeyer)

    If anyone looking for a C++ solution to convert HTML to Markdown, I'm using this repo https://github.com/tim-gromeyer/html2md in my app.

  7. llm

    Access large language models from the command-line

    That's a great use case, you might be able to do this if you've got a copy and paste on the command line with

    https://github.com/simonw/llm

    In between. An alias like pdfwtf translating to "paste | llm command | copy"

  8. webscrapbook

    A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.

  9. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  10. scrapedown

    A simple worker for extracting page content for a given URL

    Here is an open source alternative to this tool: https://github.com/ozanmakes/scrapedown

  11. KeenWrite

    I wrote a series of blog posts about typesetting Markdown using pandoc:

    https://dave.autonoma.ca/blog

    I found pandoc on its own to be a little limiting:

    * Awkward to use interpolated variables within prose.

    * No real-time preview prior to rendering the final document.

    * Limited options for TeX support (e.g., SVG vs. inline; ConTeXt vs. LaTeX).

    * Inconsistent syntax for captions and cross-references.

    For my purposes, I wanted to convert variable-laden Markdown and R Markdown to text, XHTML, and PDF formats. Eventually I replaced my tool chain of yamlp + pandoc + knitr with an integrated FOSS cross-platform desktop editor.

    https://keenwrite.com/

    KeenWrite uses flexmark-java + Renjin to provide a solution that can replace pandoc + knitr.

    Note how the captions and cross-reference syntax for images, tables, and equations is unified to use a double-colon sigil:

    https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/ref...

  12. markdownload

    A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file.

    This fork:

    https://github.com/deathau/markdownload

    With extension available for Firefox, Google Chrome, Microsoft Edge and Safari.

  13. omnivore

    Omnivore is a complete, open source read-it-later solution for people who like reading.

  14. to-markdown

    πŸ› An HTML to Markdown converter written in JavaScript

    https://mixmark-io.github.io/turndown/

    With some configuration you can get most of the way there.

  15. easy-astro-blog-creator

    An easy personal blog template for Github Pages.

    I built a no-code github blog deployer thingy that lets you deploy a blog to to github pages from a codespace. https://github.com/ShelbyJenkins/easy-astro-blog-creator

    Anyways, it uses astro + markdown.

    It'd be really neat if I could scrape my medium account to convert it to markdown to save me the trouble.

  16. parser

    πŸ“œ Extract meaningful content from the chaos of a web page

    Thoroughly scraping is challenging, especially in an environment where you don’t have (or want) a JavaScript runtime.

    For content extraction, I found the approach the Postlight library takes quite neat. It scores individual html nodes based on some heuristics (text length, link density, css classes). It the selects the nodes with the highest score. [1] I ported it to Swift for a personal read later app.

    [1] https://github.com/postlight/parser

  17. gather-cli

    I've been using gather-cli[0]for this, built by the venerable Brett Terpstra.

    [0] https://github.com/ttscoff/gather-cli

  18. metascraper

    Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.

  19. InfluxDB

    InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Ask HN: What note taking app do you guys using as a developer?

    2 projects | news.ycombinator.com | 8 Mar 2025
  • Ask HN: Are there web-of-trust style online communities?

    1 project | news.ycombinator.com | 9 Jan 2025
  • Las Vegas staff say MrBeast should be 'blacklisted', cite OSHA

    1 project | news.ycombinator.com | 26 Sep 2024
  • Show HN: Open-source and privacy focused offline translation in the browser

    1 project | news.ycombinator.com | 27 Jun 2024
  • Jack Dorsey says that he's not on the Bluesky board anymore

    1 project | news.ycombinator.com | 8 May 2024