-
emubert-creator
The training code behind EmuBert, the largest open-source masked language model for Australian law.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
⦁ Text embedding.
Not only that but, despite only being trained to guess missing words, EmuBert seems to know facts such as that Norfolk Island is an Australian territory (try the prompt, 'Norfolk Island is an Australian .'), it is Section 51 of the Constitution that grants Parliament the power to make laws for the peace, order, and good government of the Commonwealth ('Section of the Constitution grants the Australian Parliament the power to make laws for the peace, order, and good government of the Commonwealth.'), and that the representative of the monarch of Australia is the Governor-General ('The representative of the monarch of Australia is the -General.').
Finally, EmuBert achieves a perplexity of 2.05 on the Open Australian Legal QA, the first open dataset of Australian legal questions and answers, outperforming all known state-of-the-art masked language models, including Roberta, Bert and Legal-Bert.
You can check out EmuBert on Hugging Face here: https://huggingface.co/umarbutler/emubert
The code I used to create EmuBert is also openly available on GitHub: https://github.com/umarbutler/emubert-creator