Minifying HTML for GPT-4o: Just Remove All the HTML Tags

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • Jade

    Pug – robust, elegant, feature rich template engine for Node.js

    I wonder if this is due to some template engines looking minimalist like that. I think maybe Pug?

    https://github.com/pugjs/pug?tab=readme-ov-file#syntax

    It is whitespace sensitive though, but essentially looks like that. I doubt this is the only unique template engine like this though.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • dom-to-semantic-markdown

    DOM to Semantic-Markdown for use with LLMs

    I found that reducing html down to markdown using turndown or https://github.com/romansky/dom-to-semantic-markdown works well;

    if you want the AI to be able to select stuff, give it cheerio or jQuery access to navigate through the html document;

    if you need to give tags, classes, and ids to the llm, I use an html-to-pug converter like https://www.npmjs.com/package/html2pug which strips a lot of text and cuts costs. I don't think LLMs are particularly trained on pug content though so take this with a grain of salt

  • playwright-scrape-api

    A dead simple REST API to use Playwright to scrape the text contents from any URL.

    I roughly came to the same conclusion a few months back and wrote a simple, open source general purpose scraper for use with GPT using Playwright in C# and TypeScript that's fairly easy to deploy and use with GPT function calling[0]. My observation was that using `document.body.innerText` was sufficient for GPT to "understand" the page.

    I use more or less this code as a starting point for a variety of use cases and it seems to work just fine. Some variations are to look for the `main` content and ignore `nav` and `footer` (or variants thereof whether using semantic tags or CSS selectors).

    [0] https://github.com/CharlieDigital/playwright-scrape-api

  • html2markdown

    Convert HTML to Markdown with Elixir (by agoodway)

    What I do is convert to markdown, that way you still get some semantic structure. Even built an Elixir library for this: https://github.com/agoodway/html2markdown

  • strip-tags

    CLI tool for stripping tags from HTML

    I built a CLI tool (and Python library) for this a while ago called strip-tags: https://github.com/simonw/strip-tags

    By default it will strip all HTML tags and return just the text:

        curl 'https://simonwillison.net/' | strip-tags

  • simonwillisonblog

    The source code behind my blog

    I also often use the https://r.jina.ai/ proxy - add a URL to that and it extracts the key content (using Puppeteer) and returns it converted to Markdown, e.g. https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato...

  • openai-cookbook

    Examples and guides for using the OpenAI API

    Yep - one good option is to use Wikipedia pages from the recent Olympics, which GPT has no knowledge of: https://github.com/openai/openai-cookbook/blob/457f4310700f9...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Ask HN: Freelancer? Seeking freelancer? (July 2024)

    7 projects | news.ycombinator.com | 1 Jul 2024
  • Composite primary key support lands in Django

    1 project | news.ycombinator.com | 1 Dec 2024
  • Django project - Part 2 Postgres

    3 projects | dev.to | 29 Nov 2024
  • Ask HN: Recommendation for a SWE looking to get up to speed with latest on AI

    5 projects | news.ycombinator.com | 27 Nov 2024
  • Exploring the Instructor Library: Structuring Unstructured Data (and Some Fun along the Way)

    1 project | dev.to | 16 Nov 2024

Did you konow that Python is
the 2nd most popular programming language
based on number of metions?