selectolax
lambdasoup
selectolax | lambdasoup | |
---|---|---|
6 | 3 | |
970 | 376 | |
- | - | |
7.7 | 2.4 | |
about 2 months ago | 16 days ago | |
Cython | OCaml | |
MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
selectolax
-
GitHub – GSA/code-gov: An informative repo for all Code.gov repos
https://github.com/rushter/selectolax#simple-benchmark )
(Apache Nutch is a Java-based web crawler which supports e.g. CommonCrawl (which backs various foundational LLMs)) https://en.wikipedia.org/wiki/Apache_Nutch#Search_engines_bu... . But extruct extracts more types of metadata and data than Nutch AFAIU: https://github.com/scrapinghub/extruct )
datasette-graphql adds a GraphQL HTTP API to a SQLite database:
-
8 Most Popular Python HTML Web Scraping Packages with Benchmarks
selectolax
- High performance code in Python
-
Web Scraping with Python: Everything you need to know to get started (2022)
try this... https://github.com/rushter/selectolax
-
The State of Web Scraping in 2021
Lazyweb link: https://github.com/rushter/selectolax
although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_
> Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.
although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor
---
> It looks like the author of the article just googled some libraries for each language and didn't research the topic
Heh, oh, new to the Internet, are you? :-D
- Show HN: Fast HTML5 parser for Python with multiple backends
lambdasoup
-
Soupault: A static website management tool
I'm using soupault right now to make a simple company wiki (under a dozen pages). I like how it's HTML-first and easily customizable, compared to other static site generators that come with too many bells and whistles. Although now I have to make my own image compression script...
Also, fun fact: soupault is written in OCaml, which apparently has a really nice library for HTML manipulation: https://github.com/aantron/lambdasoup
-
The State of Web Scraping in 2021
OCaml’s Lambda Soup (https://aantron.github.io/lambdasoup/) is a amazing library/, especially for those that prefer functional programming
-
Soupault (soup-oh) is a tool that helps you create and manage static websites
It's used for sorting "widgets" (page processing steps) according to dependency lists that users can specify in the config (like `after = ["foo", "bar"]`).
Other than that, one thing I really like about OCaml is that the compiler team and most library maintainers are considerate towards downstream users with respect to compatibility.
The Lua interpreter [3] that soupault uses for its plugin API is a revived 20 year old research project. It only needed minor modifications to build with recent compiler versions.
[1] https://github.com/aantron/lambdasoup
What are some alternatives?
lxml - The lxml XML toolkit for Python
soupault - Static website generator based on HTML element tree rewriting
lexbor - Lexbor is development of an open source HTML Renderer library. https://lexbor.com
otoml - TOML parsing, manipulation, and pretty-printing library for OCaml (fully 1.0.0-compliant)
html5lib - Standards-compliant library for parsing and serializing HTML documents and fragments in Python
ocaml-tsort - Easy to use and user-friendly topological sort module for OCaml
pyppeteer - Headless chrome/chromium automation library (unofficial port of puppeteer)
soupault.app - The source code of the soupault.app website
pyquery - A jquery-like library for python
gazpacho - 🥫 The simple, fast, and modern web scraping library
utls - Fork of the Go standard TLS library, providing low-level access to the ClientHello for mimicry purposes.