-
This is one of those things that the ever-amazing pandoc (https://pandoc.org/) does very well, on top of supporting virtually every other document format.
-
Civic Auth
Auth in Less Than 5 Minutes. Civic Auth comes with multiple SSO options, optional embedded wallets, and user management β all implemented with just a few lines of code. Start building today.
-
python-readability
fast python port of arc90's readability tool, updated to match latest readability.js!
One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability
-
-
markdown-clipper
Discontinued A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file. [Moved to: https://github.com/deathau/markdownload]
-
If anyone looking for a C++ solution to convert HTML to Markdown, I'm using this repo https://github.com/tim-gromeyer/html2md in my app.
-
That's a great use case, you might be able to do this if you've got a copy and paste on the command line with
https://github.com/simonw/llm
In between. An alias like pdfwtf translating to "paste | llm command | copy"
-
webscrapbook
A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
Here is an open source alternative to this tool: https://github.com/ozanmakes/scrapedown
-
I wrote a series of blog posts about typesetting Markdown using pandoc:
https://dave.autonoma.ca/blog
I found pandoc on its own to be a little limiting:
* Awkward to use interpolated variables within prose.
* No real-time preview prior to rendering the final document.
* Limited options for TeX support (e.g., SVG vs. inline; ConTeXt vs. LaTeX).
* Inconsistent syntax for captions and cross-references.
For my purposes, I wanted to convert variable-laden Markdown and R Markdown to text, XHTML, and PDF formats. Eventually I replaced my tool chain of yamlp + pandoc + knitr with an integrated FOSS cross-platform desktop editor.
https://keenwrite.com/
KeenWrite uses flexmark-java + Renjin to provide a solution that can replace pandoc + knitr.
Note how the captions and cross-reference syntax for images, tables, and equations is unified to use a double-colon sigil:
https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/ref...
-
markdownload
A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file.
This fork:
https://github.com/deathau/markdownload
With extension available for Firefox, Google Chrome, Microsoft Edge and Safari.
-
-
https://mixmark-io.github.io/turndown/
With some configuration you can get most of the way there.
-
I built a no-code github blog deployer thingy that lets you deploy a blog to to github pages from a codespace. https://github.com/ShelbyJenkins/easy-astro-blog-creator
Anyways, it uses astro + markdown.
It'd be really neat if I could scrape my medium account to convert it to markdown to save me the trouble.
-
Thoroughly scraping is challenging, especially in an environment where you donβt have (or want) a JavaScript runtime.
For content extraction, I found the approach the Postlight library takes quite neat. It scores individual html nodes based on some heuristics (text length, link density, css classes). It the selects the nodes with the highest score. [1] I ported it to Swift for a personal read later app.
[1] https://github.com/postlight/parser
-
I've been using gather-cli[0]for this, built by the venerable Brett Terpstra.
[0] https://github.com/ttscoff/gather-cli
-
metascraper
Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.
-
InfluxDB
InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.
Related posts
-
Ask HN: What note taking app do you guys using as a developer?
-
Ask HN: Are there web-of-trust style online communities?
-
Las Vegas staff say MrBeast should be 'blacklisted', cite OSHA
-
Show HN: Open-source and privacy focused offline translation in the browser
-
Jack Dorsey says that he's not on the Bluesky board anymore