Show HN: I made a tool to clean and convert any webpage to Markdown

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • pandoc

    Universal markup converter

  • This is one of those things that the ever-amazing pandoc (https://pandoc.org/) does very well, on top of supporting virtually every other document format.

  • python-readability

    fast python port of arc90's readability tool, updated to match latest readability.js!

  • One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

    SurveyJS logo
  • tidy-html5

    The granddaddy of HTML tools, with support for modern standards

  • markdown-clipper

    Discontinued A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file. [Moved to: https://github.com/deathau/markdownload]

  • html2md

    Transform your HTML into clean, easy-to-read markdown with html2md. (by tim-gromeyer)

  • If anyone looking for a C++ solution to convert HTML to Markdown, I'm using this repo https://github.com/tim-gromeyer/html2md in my app.

  • llm

    Access large language models from the command-line (by simonw)

  • That's a great use case, you might be able to do this if you've got a copy and paste on the command line with

    https://github.com/simonw/llm

    In between. An alias like pdfwtf translating to "paste | llm command | copy"

  • webscrapbook

    A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • scrapedown

    A simple worker for extracting page content for a given URL

  • Here is an open source alternative to this tool: https://github.com/ozanmakes/scrapedown

  • KeenWrite

  • I wrote a series of blog posts about typesetting Markdown using pandoc:

    https://dave.autonoma.ca/blog

    I found pandoc on its own to be a little limiting:

    * Awkward to use interpolated variables within prose.

    * No real-time preview prior to rendering the final document.

    * Limited options for TeX support (e.g., SVG vs. inline; ConTeXt vs. LaTeX).

    * Inconsistent syntax for captions and cross-references.

    For my purposes, I wanted to convert variable-laden Markdown and R Markdown to text, XHTML, and PDF formats. Eventually I replaced my tool chain of yamlp + pandoc + knitr with an integrated FOSS cross-platform desktop editor.

    https://keenwrite.com/

    KeenWrite uses flexmark-java + Renjin to provide a solution that can replace pandoc + knitr.

    Note how the captions and cross-reference syntax for images, tables, and equations is unified to use a double-colon sigil:

    https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/ref...

  • markdownload

    A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file.

  • This fork:

    https://github.com/deathau/markdownload

    With extension available for Firefox, Google Chrome, Microsoft Edge and Safari.

  • omnivore

    Omnivore is a complete, open source read-it-later solution for people who like reading.

  • to-markdown

    πŸ› An HTML to Markdown converter written in JavaScript

  • https://mixmark-io.github.io/turndown/

    With some configuration you can get most of the way there.

  • easy-astro-blog-creator

    An easy personal blog template for Github Pages.

  • I built a no-code github blog deployer thingy that lets you deploy a blog to to github pages from a codespace. https://github.com/ShelbyJenkins/easy-astro-blog-creator

    Anyways, it uses astro + markdown.

    It'd be really neat if I could scrape my medium account to convert it to markdown to save me the trouble.

  • parser

    πŸ“œ Extract meaningful content from the chaos of a web page

  • Thoroughly scraping is challenging, especially in an environment where you don’t have (or want) a JavaScript runtime.

    For content extraction, I found the approach the Postlight library takes quite neat. It scores individual html nodes based on some heuristics (text length, link density, css classes). It the selects the nodes with the highest score. [1] I ported it to Swift for a personal read later app.

    [1] https://github.com/postlight/parser

  • gather-cli

  • I've been using gather-cli[0]for this, built by the venerable Brett Terpstra.

    [0] https://github.com/ttscoff/gather-cli

  • metascraper

    Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts