In-Database Machine Learning
Adam and Jorge here, and today we’re very excited to share MindsDB with you (http://github.com/mindsdb/mindsdb). MindsDB AutoML Server is an open-source platform designed to accelerate machine learning workflows for people with data inside databases by introducing virtual AI tables. We allow you to create and consume machine learning models as regular database tables.
Jorge and I have been friends for many years, having first met at college. We have previously founded and failed at another startup, but we stuck together as a team to start MindsDB. Initially a passion project, MindsDB began as an idea to help those who could not afford to hire a team of data scientists, which at the time was (and still is) very expensive. It has since grown into a thriving open-source community with contributors and users all over the globe.
With the plethora of data available in databases today, predictive modeling can often be a pain, especially if you need to write complex applications for ingesting data, training encoders and embedders, writing sampling algorithms, training models, optimizing, scheduling, versioning, moving models into production environments, maintaining them and then having to explain the predictions and the degree of confidence… we knew there had to be a better way!
We aim to steer you away from constantly reinventing the wheel by abstracting most of the unnecessary complexities around building, training, and deploying machine learning models. MindsDB provides you with two techniques for this: build and train models as simply as you would write an SQL query, and seamlessly “publish” and manage machine learning models as virtual tables inside your databases (we support Clickhouse, MariaDB, MySQL, PostgreSQL, and MSSQL. MongoDB is coming soon.) We also support getting data from other sources, such as Snowflake, s3, SQLite, and any excel, JSON, or CSV file.
When we talk to our growing community, we find that they are using MindsDB for anything ranging from reducing financial risk in the payments sector to predicting in-app usage statistics - one user is even trying to predict the price of Bitcoin using sentiment analysis (we wish them luck). No matter what the use-case, what we hear most often is that the two most painful parts of the whole process are model generation (R&D) and/or moving the model into production.
For those who already have models (i.e. who have already done the R&D part), we are launching the ability to bring your own models from frameworks like Pytorch, Tensorflow, scikit-learn, Keras, XGBoost, CatBoost, LightGBM, etc. directly into your database. If you’d like to try this experimental feature, you can sign-up here: (https://mindsdb.com/bring-your-own-ml-models)
We currently have a handful of customers who pay us for support. However, we will soon be launching a cloud version of MindsDB for those who do not want to worry about DevOps, scalability, and managing GPU clusters. Nevertheless, MindsDB will always remain free and open-source, because democratizing machine learning is at the core of every decision we make.
We’re making good progress thanks to our open-source community and are also grateful to have the backing of the founders of MySQL & MariaDB. We would love your feedback and invite you to try it out.
Thanks in advance,
Lightwood is Legos for Machine Learning.
3. A decoder that is trained to generate images takes that representation and generates an image1.
Note: above is a good illustrative example, in practice, we're good with outputting dates, numerical, categories, tags and time-series (i.e. predicting 20 steps ahead). We haven't put much work into image/text/audio/video outputs
You should be able to find more details about how we do this in the docs and most of the heavy lifting happens in the lightwood repo, the code for that is fairly readable I hope: https://github.com/mindsdb/lightwood
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
Public dataset benchmarks used for measuring the performance of MindsDB. (by mindsdb)
Regrading benchmarks, we have three main dataset collections we focus on currently:
1. Datasets from customers, but obviously those can’t be made public.
2. The OpenML benchmark, which is fairly limited because it’s mainly binary categories, but which is good because it’s a 3rd party, so unbiased. We have some intermediary results here (https://docs.google.com/spreadsheets/d/1oAgzzDyBqgmSNC6g9CFO...) , they are middle-of-the-road. However I think the benchmark is pretty limited, i.e. it doesn’t cover most of the kinds of inputs and almost none of the output we support
3. An internal benchmark suite which currently has 59 datasets, mainly focused around classification and regression tasks with many inputs, timeseries problems and text. Some part of it is public but opening that up is a bit difficult due to licensing issues. I’m hoping that in the next year it will grow and 90%+ of it can be made public. We benchmarkagainst older versions of mindsdb, against hand made models we try to adapt to the task, against the state of the art accuracy for the dataset (if we can find it) and a few other auto ML frameworks (well, 1, but I hope to extend that list) [see this repo for the ones we made public: https://github.com/mindsdb/benchmarks, but I'm afraid it's a bit outdated]
That being said benchmarking for us is still WIP, since as far as I can tell nobody is trying to build open source models that are as broad as what we're currently doing (for better or worst), and the closed source services offered by various IaaS providers don't really come with public benchmark results outside of marketing.
NitroML is a modular, portable, and scalable model-quality benchmarking framework for Machine Learning and Automated Machine Learning (AutoML) pipelines.
The benchmarking challenges you are facing are pretty common in the AutoML community. My colleagues and I at Google Research are trying to solve this with https://github.com/google/nitroml. It's still super early days (no CI yet), but I think it could help your team benchmark on a set of open standard benchmark tasks as we open source more of the system.
[D] What would a good ML take home test look like for you?
1 project | reddit.com/r/MachineLearning | 9 Aug 2021
- Machine Learning that is built-in SQL and databases
- Machine Learning via SQL
- Awesome GitHub project - MindsDB: In-Database Machine Learning
GitHub/MindsDB: In-Database Machine Learning
1 project | reddit.com/r/coolgithubprojects | 26 May 2022