GitHub – GSA/code-gov: An informative repo for all Code.gov repos

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • code-gov

    An informative repo for all Code.gov repos

  • code-json-generator

    Automation that scrapes USEPA github and provides that metadata for code.gov

  • At EPA we use this to keep this up to date but it just scrapes our GitHub:

    https://github.com/USEPA/code-json-generator

    This code.gov initative comes from Obama-era push to use/release open source, but the attention now seems to be on data (data.gov) and ai (ai.gov)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • hugo-obsidian

    Discontinued simple GitHub action to parse Markdown Links into a .json file for Hugo

  • Here's a way to scrape URLs to JSON/YAML and then build static HTML with Hugo in a GitHub Action: https://github.com/jackyzha0/hugo-obsidian

    datasette is a webapp and CLI built on SQLite and Python. datasette-lite is the pyodide + WebAssembly build of datasette which can be served as static HTML, JS, and WASM SQlite.

    datasette:

  • datasette

    An open source multi-tool for exploring and publishing data

  • https://github.com/simonw/datasette-lite :

    > You can use this tool to open any SQLite database file that is hosted online and served with a `access-control-allow-origin: ` CORS header. Files served by GitHub Pages automatically include this header, as do database files that have been published online using `datasette publish`.*

    > [...] You can paste in the "raw" URL to a file, but Datasette Lite also has a shortcut: if you paste in the URL to a page on GitHub or a Gist it will automatically convert it to the "raw" URL for you

    > To load a Parquet file, pass a URL to `?parquet=`

    > [...] https://lite.datasette.io/?parquet=https://github.com/Terada...*

    There are various *-to-sqlite utilities that load data into a SQLite database for use with e.g. datasette. E.g. Pandas with `dtype_backend='arrow'` saves to Parquet.

    datasette plugins are written in Python and/or JS w/ pluggy:

  • datasette-lite

    Datasette running in your browser using WebAssembly and Pyodide

  • https://github.com/simonw/datasette-lite :

    > You can use this tool to open any SQLite database file that is hosted online and served with a `access-control-allow-origin: ` CORS header. Files served by GitHub Pages automatically include this header, as do database files that have been published online using `datasette publish`.*

    > [...] You can paste in the "raw" URL to a file, but Datasette Lite also has a shortcut: if you paste in the URL to a page on GitHub or a Gist it will automatically convert it to the "raw" URL for you

    > To load a Parquet file, pass a URL to `?parquet=`

    > [...] https://lite.datasette.io/?parquet=https://github.com/Terada...*

    There are various *-to-sqlite utilities that load data into a SQLite database for use with e.g. datasette. E.g. Pandas with `dtype_backend='arrow'` saves to Parquet.

    datasette plugins are written in Python and/or JS w/ pluggy:

  • kylo

    Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

  • https://github.com/simonw/datasette-lite :

    > You can use this tool to open any SQLite database file that is hosted online and served with a `access-control-allow-origin: ` CORS header. Files served by GitHub Pages automatically include this header, as do database files that have been published online using `datasette publish`.*

    > [...] You can paste in the "raw" URL to a file, but Datasette Lite also has a shortcut: if you paste in the URL to a page on GitHub or a Gist it will automatically convert it to the "raw" URL for you

    > To load a Parquet file, pass a URL to `?parquet=`

    > [...] https://lite.datasette.io/?parquet=https://github.com/Terada...*

    There are various *-to-sqlite utilities that load data into a SQLite database for use with e.g. datasette. E.g. Pandas with `dtype_backend='arrow'` saves to Parquet.

    datasette plugins are written in Python and/or JS w/ pluggy:

  • datasette-scraper

    Add website scraping abilities to Datasette

  • https://github.com/cldellow/datasette-scraper/#architecture

    (TIL datasette-scraper parses HTML with selectolax; and Selectolax with Modest or Lexbor is ~25x faster at HTML parsing than BeautifulSoup in the selectolax benchmark:

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • selectolax

    Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).

  • https://github.com/rushter/selectolax#simple-benchmark )

    (Apache Nutch is a Java-based web crawler which supports e.g. CommonCrawl (which backs various foundational LLMs)) https://en.wikipedia.org/wiki/Apache_Nutch#Search_engines_bu... . But extruct extracts more types of metadata and data than Nutch AFAIU: https://github.com/scrapinghub/extruct )

    datasette-graphql adds a GraphQL HTTP API to a SQLite database:

  • extruct

    Extract embedded metadata from HTML markup

  • https://github.com/rushter/selectolax#simple-benchmark )

    (Apache Nutch is a Java-based web crawler which supports e.g. CommonCrawl (which backs various foundational LLMs)) https://en.wikipedia.org/wiki/Apache_Nutch#Search_engines_bu... . But extruct extracts more types of metadata and data than Nutch AFAIU: https://github.com/scrapinghub/extruct )

    datasette-graphql adds a GraphQL HTTP API to a SQLite database:

  • datasette-ripgrep

    Web interface for searching your code using ripgrep, built as a Datasette plugin

  • https://github.com/simonw/datasette-ripgrep

    Seeing as there's already a JSONLD @context (schema) for code.json, CSVW as JSONLD and/or YAMLLD would be an easy way merge Linked Data graphs of tabular data:

  • awesome-semantic-web

    A curated list of various semantic web and linked data resources.

  • https://github.com/semantalytics/awesome-semantic-web#csvw

    A GitHub Action would run regularly, fetch each code.json, save each to a git repo, and then upsert each into a SQLite database to be published with e.g. datasette or datasette-lite.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts