selectolax
datasette-lite
selectolax | datasette-lite | |
---|---|---|
6 | 10 | |
970 | 309 | |
- | - | |
7.7 | 5.4 | |
about 2 months ago | about 1 month ago | |
Cython | HTML | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
selectolax
-
GitHub – GSA/code-gov: An informative repo for all Code.gov repos
https://github.com/rushter/selectolax#simple-benchmark )
(Apache Nutch is a Java-based web crawler which supports e.g. CommonCrawl (which backs various foundational LLMs)) https://en.wikipedia.org/wiki/Apache_Nutch#Search_engines_bu... . But extruct extracts more types of metadata and data than Nutch AFAIU: https://github.com/scrapinghub/extruct )
datasette-graphql adds a GraphQL HTTP API to a SQLite database:
-
8 Most Popular Python HTML Web Scraping Packages with Benchmarks
selectolax
- High performance code in Python
-
Web Scraping with Python: Everything you need to know to get started (2022)
try this... https://github.com/rushter/selectolax
-
The State of Web Scraping in 2021
Lazyweb link: https://github.com/rushter/selectolax
although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_
> Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.
although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor
---
> It looks like the author of the article just googled some libraries for each language and didn't research the topic
Heh, oh, new to the Internet, are you? :-D
- Show HN: Fast HTML5 parser for Python with multiple backends
datasette-lite
-
Sqlime: Online SQLite Playground
Also see: https://github.com/simonw/datasette-lite
- Use SQL Without Databases
-
GitHub – GSA/code-gov: An informative repo for all Code.gov repos
https://github.com/simonw/datasette-lite :
> You can use this tool to open any SQLite database file that is hosted online and served with a `access-control-allow-origin: ` CORS header. Files served by GitHub Pages automatically include this header, as do database files that have been published online using `datasette publish`.*
> [...] You can paste in the "raw" URL to a file, but Datasette Lite also has a shortcut: if you paste in the URL to a page on GitHub or a Gist it will automatically convert it to the "raw" URL for you
> To load a Parquet file, pass a URL to `?parquet=`
> [...] https://lite.datasette.io/?parquet=https://github.com/Terada...*
There are various *-to-sqlite utilities that load data into a SQLite database for use with e.g. datasette. E.g. Pandas with `dtype_backend='arrow'` saves to Parquet.
datasette plugins are written in Python and/or JS w/ pluggy:
-
[SQLlite] Is there any online SQL editor I can host on my website? Maybe something in JS or php
Datasette Lite might be even better for this - you can construct URLs that link directly to examples: https://github.com/simonw/datasette-lite
-
SQLite WASM Official
There are some amazing things for SQLite in the browser especially if you're looking for ways to host queryable data for cheap.
I have a hacked up POC experimental version of datasette-lite to be able to look at multi-GB databases at https://github.com/simonw/datasette-lite/pull/49. It uses a hacked up chunk'd lazyFile implementation from emscripten and others to grap pages from Cloudflare R2.
It's a test with california's unclaimed property records (https://www.sco.ca.gov/upd_download_property_records.html) of a 28GB searching up that guy who owns Twitter: https://datasette-lite-lab.mindflakes.com/index.html?url=htt...
I think there may be a space for super-large multi-GB files served from static storage being accessible from SQlite as well. Another one would be this full-text search of a 43GB SQLite database of Wikipedia's full text search: http://static.wiki/ . Hearing there's official support for this is awesome and I hope they also might add some provisions for those sticking with POSIX/Emscripten as well.
-
Hosting SQLite Databases on GitHub Pages
I grafted the enhanced lazyFile implementation of this to datasette-lite relatively recently. Threw in a 18GB CSV from
https://www.sco.ca.gov/upd_download_property_records.html
into a FTS5 Sqlite Database which came out to about 28GB after processing:
POC, non-merging Draft PR for the hack:
https://github.com/simonw/datasette-lite/pull/49
You can run queries through it if you URL hack into it and just get to the query dialog, browsing is kind of a dud at the moment since datasette runs a count(*) which downloads everything.
- Learn Postgres at the Playground
-
A SQLite extension for reading large files line-by-line
Oh wow! I wonder how hard it would be to load that module into https://github.com/simonw/datasette-lite
-
This Week in Python
datasette-lite – Datasette running in your browser using WebAssembly and Pyodide
-
Datasette Lite: a server-side Python web application running in a browser
I have an open issue for that here: https://github.com/simonw/datasette-lite/issues/28
My initial hunch is that this will be really difficult - probably require a fork of something like https://github.com/coleifer/pysqlite3 then compiled for WebAssembly.
I'm confident it's feasible, but I don't have the skills to figure it out myself.
What are some alternatives?
lxml - The lxml XML toolkit for Python
pyscript - Try PyScript: https://pyscript.com Examples: https://tinyurl.com/pyscript-examples Community: https://discord.gg/HxvBtukrg2
lexbor - Lexbor is development of an open source HTML Renderer library. https://lexbor.com
sqlite-plus - The ultimate set of SQLite extensions
html5lib - Standards-compliant library for parsing and serializing HTML documents and fragments in Python
file-system-access - Expose the file system on the user’s device, so Web apps can interoperate with the user’s native applications.
pyppeteer - Headless chrome/chromium automation library (unofficial port of puppeteer)
datastation - App to easily query, script, and visualize data from every database, file, and API.
pyquery - A jquery-like library for python
pyodide - Pyodide is a Python distribution for the browser and Node.js based on WebAssembly
gazpacho - 🥫 The simple, fast, and modern web scraping library
mergestat-lite - Query git repositories with SQL. Generate reports, perform status checks, analyze codebases. 🔍 📊