-
I wonder if this is due to some template engines looking minimalist like that. I think maybe Pug?
https://github.com/pugjs/pug?tab=readme-ov-file#syntax
It is whitespace sensitive though, but essentially looks like that. I doubt this is the only unique template engine like this though.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
I found that reducing html down to markdown using turndown or https://github.com/romansky/dom-to-semantic-markdown works well;
if you want the AI to be able to select stuff, give it cheerio or jQuery access to navigate through the html document;
if you need to give tags, classes, and ids to the llm, I use an html-to-pug converter like https://www.npmjs.com/package/html2pug which strips a lot of text and cuts costs. I don't think LLMs are particularly trained on pug content though so take this with a grain of salt
-
playwright-scrape-api
A dead simple REST API to use Playwright to scrape the text contents from any URL.
I roughly came to the same conclusion a few months back and wrote a simple, open source general purpose scraper for use with GPT using Playwright in C# and TypeScript that's fairly easy to deploy and use with GPT function calling[0]. My observation was that using `document.body.innerText` was sufficient for GPT to "understand" the page.
I use more or less this code as a starting point for a variety of use cases and it seems to work just fine. Some variations are to look for the `main` content and ignore `nav` and `footer` (or variants thereof whether using semantic tags or CSS selectors).
[0] https://github.com/CharlieDigital/playwright-scrape-api
-
What I do is convert to markdown, that way you still get some semantic structure. Even built an Elixir library for this: https://github.com/agoodway/html2markdown
-
I built a CLI tool (and Python library) for this a while ago called strip-tags: https://github.com/simonw/strip-tags
By default it will strip all HTML tags and return just the text:
curl 'https://simonwillison.net/' | strip-tags
-
I also often use the https://r.jina.ai/ proxy - add a URL to that and it extracts the key content (using Puppeteer) and returns it converted to Markdown, e.g. https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato...
-
Yep - one good option is to use Wikipedia pages from the recent Olympics, which GPT has no knowledge of: https://github.com/openai/openai-cookbook/blob/457f4310700f9...
Related posts
-
Ask HN: Freelancer? Seeking freelancer? (July 2024)
-
Composite primary key support lands in Django
-
Django project - Part 2 Postgres
-
Ask HN: Recommendation for a SWE looking to get up to speed with latest on AI
-
Exploring the Instructor Library: Structuring Unstructured Data (and Some Fun along the Way)