Our great sponsors
-
python-readability
fast python port of arc90's readability tool, updated to match latest readability.js!
-
SurveyJS
Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
-
markdown-clipper
Discontinued A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file. [Moved to: https://github.com/deathau/markdownload]
-
webscrapbook
A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
markdownload
A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file.
-
metascraper
Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
This is one of those things that the ever-amazing pandoc (https://pandoc.org/) does very well, on top of supporting virtually every other document format.
One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability
If anyone looking for a C++ solution to convert HTML to Markdown, I'm using this repo https://github.com/tim-gromeyer/html2md in my app.
That's a great use case, you might be able to do this if you've got a copy and paste on the command line with
https://github.com/simonw/llm
In between. An alias like pdfwtf translating to "paste | llm command | copy"
Here is an open source alternative to this tool: https://github.com/ozanmakes/scrapedown
I wrote a series of blog posts about typesetting Markdown using pandoc:
https://dave.autonoma.ca/blog
I found pandoc on its own to be a little limiting:
* Awkward to use interpolated variables within prose.
* No real-time preview prior to rendering the final document.
* Limited options for TeX support (e.g., SVG vs. inline; ConTeXt vs. LaTeX).
* Inconsistent syntax for captions and cross-references.
For my purposes, I wanted to convert variable-laden Markdown and R Markdown to text, XHTML, and PDF formats. Eventually I replaced my tool chain of yamlp + pandoc + knitr with an integrated FOSS cross-platform desktop editor.
https://keenwrite.com/
KeenWrite uses flexmark-java + Renjin to provide a solution that can replace pandoc + knitr.
Note how the captions and cross-reference syntax for images, tables, and equations is unified to use a double-colon sigil:
https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/ref...
This fork:
https://github.com/deathau/markdownload
With extension available for Firefox, Google Chrome, Microsoft Edge and Safari.
https://mixmark-io.github.io/turndown/
With some configuration you can get most of the way there.
I built a no-code github blog deployer thingy that lets you deploy a blog to to github pages from a codespace. https://github.com/ShelbyJenkins/easy-astro-blog-creator
Anyways, it uses astro + markdown.
It'd be really neat if I could scrape my medium account to convert it to markdown to save me the trouble.
Thoroughly scraping is challenging, especially in an environment where you donβt have (or want) a JavaScript runtime.
For content extraction, I found the approach the Postlight library takes quite neat. It scores individual html nodes based on some heuristics (text length, link density, css classes). It the selects the nodes with the highest score. [1] I ported it to Swift for a personal read later app.
[1] https://github.com/postlight/parser
I've been using gather-cli[0]for this, built by the venerable Brett Terpstra.
[0] https://github.com/ttscoff/gather-cli
Related posts
- Show HN: Zenfetch β Turn your saved browsing content into an AI second brain
- Elon Musk Fans Horrified When His Grok AI Immediately "Goes Woke"
- 9 years ago my crew found the oldest time capsule in US history buried in a cornerstone of the Mass. State House.
- me irl
- Oh great, they redesigned reddit so it has a smaller display font.