Financial Sentiment Analysis with BERT
In this article, we’ll tackle this batch jobs-with-containers scenario. To make this concrete, let’s say that every morning, a vendor gives us a relatively large dump of everything everyone has said on the internet (Twitter, Reddit, Seeking Alpha, etc.) about the companies in the S&P 500 overnight. We want to feed these pieces of text into FinBERT, which is a version of BERT that has been fine-tuned for financial sentiment analysis. BERT is a language model from Google that was state of the art when it was published in 2018.
The Python Package Index
If you think of an ML model as a library, then it might seem more natural to publish it as a package, either on PyPI for use with pip or Anaconda.org for use with conda, rather than a container. Hugging Face’s transformers is a good example — you run pip install transformers, then your Python interpreter can do things like:
Tired of breaking your main and manually rebasing outdated pull requests?. Managing outdated pull requests is time-consuming. Mergify's Merge Queue automates your pull request management & merging. It's fully integrated to GitHub & coordinated with any CI. Start focusing on code. Try Mergify for free.
Define and run multi-container applications with Docker
This works great for playing around with our model locally, but at some point we’ll probably want to run this on the cloud, either to access additional compute whether that’s CPU or GPU, or to run this as a scheduled job with e.g. Airflow. The usual thing to do is package up our code (finbert_local_example.py and its dependencies) as a container, which means now we have two containers — one containing our glue code, and the FinBERT container that we need to launch together and coordinate (i.e. our glue code container needs to know the address/name of the FinBERT container to access it). We might start reaching for Docker Compose which works great for long-running services, but in the context of an ad-hoc distributed batch job or a scheduled job, it will be tricky to work with.
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
As a bit of an aside, you could imagine a way to get the best of both worlds with an extension to Docker that would allow you to publish a container that exposes a Python API, so that someone could call sentiment = call_container_api(image="huggingface/transformers", "my input text") directly from their python code. This would effectively be a remote procedure call into a container that is not running as a service but instead spun up just for the purpose of executing a function on-demand. This feels like a really heavyweight approach to solving dependency hell, but if your libraries are using a cross-platform memory format (hello Apache Arrow!) under the covers, you could imagine doing some fun tricks like giving the container a read-only view into the caller’s memory space to reduce the overhead. It’s a bit implausible, but sometimes it’s helpful to sketch out these ideas to clarify the tradeoffs we’re making with the more practical bits of technology we have available.
Introduction to Docker - Part2 (Docker Compose, Docker File, Docker Volume)
1 project | dev.to | 17 Sep 2023
Golang testing using docker services via dockertest
3 projects | dev.to | 3 Sep 2023
Install Docker on Remote Server using Ansible
2 projects | dev.to | 28 Aug 2023
Docker MERN stack example
3 projects | dev.to | 22 Aug 2023
How to update Docker and Docker Compose on Ubuntu
1 project | dev.to | 18 Jul 2023