extruct
datasette-lite
extruct | datasette-lite | |
---|---|---|
3 | 10 | |
821 | 309 | |
1.3% | - | |
3.8 | 5.4 | |
12 days ago | about 1 month ago | |
Python | HTML | |
BSD 3-clause "New" or "Revised" License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
extruct
-
GitHub – GSA/code-gov: An informative repo for all Code.gov repos
https://github.com/rushter/selectolax#simple-benchmark )
(Apache Nutch is a Java-based web crawler which supports e.g. CommonCrawl (which backs various foundational LLMs)) https://en.wikipedia.org/wiki/Apache_Nutch#Search_engines_bu... . But extruct extracts more types of metadata and data than Nutch AFAIU: https://github.com/scrapinghub/extruct )
datasette-graphql adds a GraphQL HTTP API to a SQLite database:
-
Alternative to extruct python library ? (scraping schema.org, jsonld, twitter and fb card)
Is there an alternative for extruct python library in golang ?
-
Scraping MMA fighter stats from a list of names
Seems like sherdog.com supports schema.org data markup - which is really easy to scrape! There's a brilliant python parser for https://github.com/scrapinghub/extruct.
datasette-lite
-
Sqlime: Online SQLite Playground
Also see: https://github.com/simonw/datasette-lite
- Use SQL Without Databases
-
GitHub – GSA/code-gov: An informative repo for all Code.gov repos
https://github.com/simonw/datasette-lite :
> You can use this tool to open any SQLite database file that is hosted online and served with a `access-control-allow-origin: ` CORS header. Files served by GitHub Pages automatically include this header, as do database files that have been published online using `datasette publish`.*
> [...] You can paste in the "raw" URL to a file, but Datasette Lite also has a shortcut: if you paste in the URL to a page on GitHub or a Gist it will automatically convert it to the "raw" URL for you
> To load a Parquet file, pass a URL to `?parquet=`
> [...] https://lite.datasette.io/?parquet=https://github.com/Terada...*
There are various *-to-sqlite utilities that load data into a SQLite database for use with e.g. datasette. E.g. Pandas with `dtype_backend='arrow'` saves to Parquet.
datasette plugins are written in Python and/or JS w/ pluggy:
-
[SQLlite] Is there any online SQL editor I can host on my website? Maybe something in JS or php
Datasette Lite might be even better for this - you can construct URLs that link directly to examples: https://github.com/simonw/datasette-lite
-
SQLite WASM Official
There are some amazing things for SQLite in the browser especially if you're looking for ways to host queryable data for cheap.
I have a hacked up POC experimental version of datasette-lite to be able to look at multi-GB databases at https://github.com/simonw/datasette-lite/pull/49. It uses a hacked up chunk'd lazyFile implementation from emscripten and others to grap pages from Cloudflare R2.
It's a test with california's unclaimed property records (https://www.sco.ca.gov/upd_download_property_records.html) of a 28GB searching up that guy who owns Twitter: https://datasette-lite-lab.mindflakes.com/index.html?url=htt...
I think there may be a space for super-large multi-GB files served from static storage being accessible from SQlite as well. Another one would be this full-text search of a 43GB SQLite database of Wikipedia's full text search: http://static.wiki/ . Hearing there's official support for this is awesome and I hope they also might add some provisions for those sticking with POSIX/Emscripten as well.
-
Hosting SQLite Databases on GitHub Pages
I grafted the enhanced lazyFile implementation of this to datasette-lite relatively recently. Threw in a 18GB CSV from
https://www.sco.ca.gov/upd_download_property_records.html
into a FTS5 Sqlite Database which came out to about 28GB after processing:
POC, non-merging Draft PR for the hack:
https://github.com/simonw/datasette-lite/pull/49
You can run queries through it if you URL hack into it and just get to the query dialog, browsing is kind of a dud at the moment since datasette runs a count(*) which downloads everything.
- Learn Postgres at the Playground
-
A SQLite extension for reading large files line-by-line
Oh wow! I wonder how hard it would be to load that module into https://github.com/simonw/datasette-lite
-
This Week in Python
datasette-lite – Datasette running in your browser using WebAssembly and Pyodide
-
Datasette Lite: a server-side Python web application running in a browser
I have an open issue for that here: https://github.com/simonw/datasette-lite/issues/28
My initial hunch is that this will be really difficult - probably require a fork of something like https://github.com/coleifer/pysqlite3 then compiled for WebAssembly.
I'm confident it's feasible, but I don't have the skills to figure it out myself.
What are some alternatives?
rdflib - RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
pyscript - Try PyScript: https://pyscript.com Examples: https://tinyurl.com/pyscript-examples Community: https://discord.gg/HxvBtukrg2
PyLD - JSON-LD processor written in Python
sqlite-plus - The ultimate set of SQLite extensions
contextualise - Contextualise is an effective tool particularly suited for organising information-heavy projects and activities consisting of unstructured and widely diverse data and information resources
file-system-access - Expose the file system on the user’s device, so Web apps can interoperate with the user’s native applications.
code-gov - An informative repo for all Code.gov repos
datastation - App to easily query, script, and visualize data from every database, file, and API.
kylo - Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
pyodide - Pyodide is a Python distribution for the browser and Node.js based on WebAssembly
metatron - A Python 3.x HTML Meta tag parser, with emphasis on OpenGraph and complex meta tag schemes
mergestat-lite - Query git repositories with SQL. Generate reports, perform status checks, analyze codebases. 🔍 📊