extruct
kylo
Our great sponsors
extruct | kylo | |
---|---|---|
3 | 1 | |
819 | 1,091 | |
2.3% | 0.5% | |
3.8 | 10.0 | |
6 days ago | over 1 year ago | |
Python | Java | |
BSD 3-clause "New" or "Revised" License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
extruct
-
GitHub – GSA/code-gov: An informative repo for all Code.gov repos
https://github.com/rushter/selectolax#simple-benchmark )
(Apache Nutch is a Java-based web crawler which supports e.g. CommonCrawl (which backs various foundational LLMs)) https://en.wikipedia.org/wiki/Apache_Nutch#Search_engines_bu... . But extruct extracts more types of metadata and data than Nutch AFAIU: https://github.com/scrapinghub/extruct )
datasette-graphql adds a GraphQL HTTP API to a SQLite database:
-
Alternative to extruct python library ? (scraping schema.org, jsonld, twitter and fb card)
Is there an alternative for extruct python library in golang ?
-
Scraping MMA fighter stats from a list of names
Seems like sherdog.com supports schema.org data markup - which is really easy to scrape! There's a brilliant python parser for https://github.com/scrapinghub/extruct.
kylo
-
GitHub – GSA/code-gov: An informative repo for all Code.gov repos
https://github.com/simonw/datasette-lite :
> You can use this tool to open any SQLite database file that is hosted online and served with a `access-control-allow-origin: ` CORS header. Files served by GitHub Pages automatically include this header, as do database files that have been published online using `datasette publish`.*
> [...] You can paste in the "raw" URL to a file, but Datasette Lite also has a shortcut: if you paste in the URL to a page on GitHub or a Gist it will automatically convert it to the "raw" URL for you
> To load a Parquet file, pass a URL to `?parquet=`
> [...] https://lite.datasette.io/?parquet=https://github.com/Terada...*
There are various *-to-sqlite utilities that load data into a SQLite database for use with e.g. datasette. E.g. Pandas with `dtype_backend='arrow'` saves to Parquet.
datasette plugins are written in Python and/or JS w/ pluggy:
What are some alternatives?
rdflib - RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
selectolax - Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
PyLD - JSON-LD processor written in Python
code-gov - An informative repo for all Code.gov repos
contextualise - Contextualise is an effective tool particularly suited for organising information-heavy projects and activities consisting of unstructured and widely diverse data and information resources
hugo-obsidian - simple GitHub action to parse Markdown Links into a .json file for Hugo
nifi-djl-processor - Apache NiFi 1.10 DJL
metatron - A Python 3.x HTML Meta tag parser, with emphasis on OpenGraph and complex meta tag schemes
awesome-semantic-web - A curated list of various semantic web and linked data resources.
PheKnowLator - PheKnowLator: Heterogeneous Biomedical Knowledge Graphs and Benchmarks Constructed Under Alternative Semantic Models
datasette-ripgrep - Web interface for searching your code using ripgrep, built as a Datasette plugin