-
open-australian-legal-corpus-creator
The code used to create and update the Open Australian Legal Corpus, the first and only multijurisdictional open corpus of Australian legislative and judicial documents.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Hey HN, today I'm sharing my latest project, the Open Australian Legal Corpus, a first-of-its-kind multijurisdictional open corpus of Australian legislative and judicial documents. The idea behind this dataset was born a few months ago, when, while attempting to pretrain a BERT model for the Australian legal domain, I discovered that there was no freely accessible, openly licensed text corpus of Australian laws and cases that I could use. This was in contrast to the US, UK and EU which all had multiple large open legal corpora available. Thus, I set out to the fill the gap in Australian legal AI research by compiling a dataset of as many in force Australian laws, regulations, bills and decisions as I could find. The end product was a corpus of 97,750 texts totalling over forty million lines and half a billion tokens, and spanning five states, one external territory and the Commonwealth.
You can view the corpus on [HuggingFace](https://huggingface.co/datasets/umarbutler/open-australian-l...) and the code used to create it on [Github]( https://github.com/umarbutler/open-australian-legal-corpus-c...).
Hey HN, today I'm sharing my latest project, the Open Australian Legal Corpus, a first-of-its-kind multijurisdictional open corpus of Australian legislative and judicial documents. The idea behind this dataset was born a few months ago, when, while attempting to pretrain a BERT model for the Australian legal domain, I discovered that there was no freely accessible, openly licensed text corpus of Australian laws and cases that I could use. This was in contrast to the US, UK and EU which all had multiple large open legal corpora available. Thus, I set out to the fill the gap in Australian legal AI research by compiling a dataset of as many in force Australian laws, regulations, bills and decisions as I could find. The end product was a corpus of 97,750 texts totalling over forty million lines and half a billion tokens, and spanning five states, one external territory and the Commonwealth.
You can view the corpus on [HuggingFace](https://huggingface.co/datasets/umarbutler/open-australian-l...) and the code used to create it on [Github]( https://github.com/umarbutler/open-australian-legal-corpus-c...).