Show HN: I created a first-of-its-kind open corpus of Australian law

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured

    Hey HN, today I'm sharing my latest project, the Open Australian Legal Corpus, a first-of-its-kind multijurisdictional open corpus of Australian legislative and judicial documents. The idea behind this dataset was born a few months ago, when, while attempting to pretrain a BERT model for the Australian legal domain, I discovered that there was no freely accessible, openly licensed text corpus of Australian laws and cases that I could use. This was in contrast to the US, UK and EU which all had multiple large open legal corpora available. Thus, I set out to the fill the gap in Australian legal AI research by compiling a dataset of as many in force Australian laws, regulations, bills and decisions as I could find. The end product was a corpus of 97,750 texts totalling over forty million lines and half a billion tokens, and spanning five states, one external territory and the Commonwealth.

    You can view the corpus on [HuggingFace](https://huggingface.co/datasets/umarbutler/open-australian-l...) and the code used to create it on [Github]( https://github.com/umarbutler/open-australian-legal-corpus-c...).

    Hey HN, today I'm sharing my latest project, the Open Australian Legal Corpus, a first-of-its-kind multijurisdictional open corpus of Australian legislative and judicial documents. The idea behind this dataset was born a few months ago, when, while attempting to pretrain a BERT model for the Australian legal domain, I discovered that there was no freely accessible, openly licensed text corpus of Australian laws and cases that I could use. This was in contrast to the US, UK and EU which all had multiple large open legal corpora available. Thus, I set out to the fill the gap in Australian legal AI research by compiling a dataset of as many in force Australian laws, regulations, bills and decisions as I could find. The end product was a corpus of 97,750 texts totalling over forty million lines and half a billion tokens, and spanning five states, one external territory and the Commonwealth.

    You can view the corpus on [HuggingFace](https://huggingface.co/datasets/umarbutler/open-australian-l...) and the code used to create it on [Github]( https://github.com/umarbutler/open-australian-legal-corpus-c...).

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Jyutcitzi Font

    1 project | /r/CantoneseScriptReform | 7 Dec 2023
  • No-code AI: OpenAI MyGPTs, LlamaIndex rags, or LangChain OpenGPTs?

    2 projects | dev.to | 2 Dec 2023
  • No-code AI: OpenGPTs by LangChain

    1 project | dev.to | 1 Dec 2023
  • WiFi driver for macOS big sur

    1 project | /r/hackintosh | 29 Nov 2023
  • Best way to memorize/learn mechanisms? Anki, flash cards, etc?

    1 project | /r/OrganicChemistry | 25 Jan 2023