|about 1 month ago||about 2 months ago|
|MIT License||MIT License|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
GitHub – GSA/code-gov: An informative repo for all Code.gov repos
12 projects | news.ycombinator.com | 9 Sep 2023
(Apache Nutch is a Java-based web crawler which supports e.g. CommonCrawl (which backs various foundational LLMs)) https://en.wikipedia.org/wiki/Apache_Nutch#Search_engines_bu... . But extruct extracts more types of metadata and data than Nutch AFAIU: https://github.com/scrapinghub/extruct )
datasette-graphql adds a GraphQL HTTP API to a SQLite database:
8 Most Popular Python HTML Web Scraping Packages with Benchmarks
4 projects | dev.to | 1 Feb 2023
The State of Web Scraping in 2021
9 projects | news.ycombinator.com | 11 Oct 2021
Lazyweb link: https://github.com/rushter/selectolax
although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_
> Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.
although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor
> It looks like the author of the article just googled some libraries for each language and didn't research the topic
Heh, oh, new to the Internet, are you? :-D
2 projects | /r/Python | 25 Nov 2021
Neither did html5lib.
Why are circular dependencies even a thing?
3 projects | /r/linuxquestions | 25 Sep 2021
Easier example...sphinx is a document generator for python programs (creating docs for the API of programs from source-code comments for example). Spinx depends on html5lib which itself again depends on six...want to make a guess what six uses to generate its API docs? ;) So if you want the api docs of six you will have to first install it without to be able to get a working sphinx install then redo the six on including the building of the API docs.
What are some alternatives?
lxml - The lxml XML toolkit for Python
bleach - Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
xhtml2pdf - A library for converting HTML into PDFs using ReportLab
lexbor - Lexbor is development of an open source HTML Renderer library. https://lexbor.com
pyppeteer - Headless chrome/chromium automation library (unofficial port of puppeteer)
pyquery - A jquery-like library for python
gazpacho - 🥫 The simple, fast, and modern web scraping library
xmltodict - Python module that makes working with XML feel like you are working with JSON