-
spider
scripts and baselines for Spider: Yale complex and cross-domain semantic parsing and text-to-SQL challenge
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Hi HN community. We are excited to open source Dataherald’s natural-language-to-SQL engine today (https://github.com/Dataherald/dataherald). This engine allows you to set up an API from your structured database that can answer questions in plain English.
GPT-4 class LLMs have gotten remarkably good at writing SQL. However, out-of-the-box LLMs and existing frameworks would not work with our own structured data at a necessary quality level. For example, given the question “what was the average rent in Los Angeles in May 2023?” a reasonable human would either assume the question is about Los Angeles, CA or would confirm the state with the question asker in a follow up. However, an LLM translates this to:
select price from rent_prices where city=”Los Angeles” AND month=”05” AND year=”2023”
This pulls data for Los Angeles, CA and Los Angeles, TX without getting columns to differentiate between the two. You can read more about the challenges of enterprise-level text-to-SQL in this blog post I wrote on the topic: https://medium.com/dataherald/why-enterprise-natural-languag...
Dataherald comes with “batteries-included.” It has best-in-class implementations of core components, including, but not limited to: a state of the art NL-to-SQL agent, an LLM-based SQL-accuracy evaluator. The architecture is modular, allowing these components to be easily replaced. It’s easy to set up and use with major data warehouses.
There is a “Context Store” where information (NL2SQL examples, schemas and table descriptions) is used for the LLM prompts to make the engine get better with usage. And we even made it fast!
This version allows you to easily connect to PG, Databricks, BigQuery or Snowflake and set up an API for semantic interactions with your structured data. You can then add business and data context that are used for few-shot prompting by the engine.
The NL-to-SQL agent in this open source release was developed by our own Mohammadreza Pourreza, whose DIN-SQL algorithm is currently top of the Spider (https://yale-lily.github.io/spider) and Bird (https://bird-bench.github.io/) NL 2 SQL benchmarks. This agent has outperformed the Langchain SQLAgent anywhere from 12%-250%.5x (depending on the provided context) in our own internal benchmarking while being only ~15s slower on average.
Needless to say, this is an early release and the codebase is under swift development. We would love for you to try it out and give us your feedback! And if you are interested in contributing, we’d love to hear from you!
We were using OpenAI's codex model with a bunch of few shot examples for natural language queries over blockchain data coupled with a finetuned completions model for plot the results when we were working on makerdojo.io.
Without the ability to fine tune Codex, we had to employ a bunch of techniques to pick the right set of examples to provide in the context to get good SQL. We have come very far since then and it has been great to see projects projects like sqlcoder and this.
Pieces of the pipeline powering makerdojo eventually became LLMStack (https://github.com/trypromptly/LLMStack). We are looking to integrate one of these SQL generators as processors in LLMStack so we can build higher level applications.
There are pretty large NL -> SQL datasets (e.g. this ~80K sample dataset: https://huggingface.co/datasets/b-mc2/sql-create-context). An OSS model based on StarCoder was also recently published which is roughly between GPT-3.5 and GPT-4: https://github.com/defog-ai/sqlcoder
I switched to chat version, named [Ada](https://github.com/BenderV/ada), and IMHO, it's better. The AI explore the database, it's connection, the data format & co. Plus, it help with ambiguity and feels more "natural".
Related posts
-
We built a self-hosted low-code platform to build LLM apps locally and open-sourced it
-
LLMStack: self-hosted low-code platform to build LLM apps locally with LocalAI support
-
LLMStack: a self-hosted low-code platform to build LLM apps locally
-
Teaching with AI
-
Show HN: LLMStack – Self-Hosted, Low-Code Platform to Build AI Experiences