When we look at Machine Learning through Software Developer’s lens, building a successful ML product or, more often, ML-assisted product features takes 5 disciplines to come together: product, data, ML, dev, and ops.
Product Design & Management encompasses the whole gamut of things: identifying the user and business needs, designing the user experience (including implicitly or explicitly collecting feedback from the use of ML-assisted features), defining business success metrics, and guiding the whole journey from conception to delivery.
It is one of the hardest parts and key to the success of the ML-assisted feature/product. Sadly, it is often ignored.
Data Engineering takes care of collecting, curating, storing, and managing the needed data at scale (aka Big Data). As Monica Rogati explained in the Data Science Hierarchy of Needs
, data engineering covers the first 2 layers (out of 6) of the pyramid. Anaconda State of Data Science 2021
report says that a good 39% of the effort goes into data cleaning and data preparation (page 14).
Without good quality data, there is no ML. And without a solid Data Engineering foundation, there is no ML product.
The spectrum of data analytics, data science, and machine learning covers designing statistical/probabilistic models vs. traditional deterministic algorithms/programs. In ML, data is logic, and some of the product features are implemented using statistical models.
Models can vary from simple linear regression to deep neural networks. Typically various models are trained, hyper-parameters tuned, and the best model selected. Bigger and superior deep learning models make news regularly. Though in production, simplicity quite often trumps cleverness, and better data wins over SOTA models
Developers knit an ML model seamlessly into the rest of the product, and continuously develop-test-deploy code to achieve business goals. They apply the rigor of software engineering principles to design, develop, test, evaluate, and maintain software systems. ML Engineers are responsible for scaling a model for mass consumption.
At present, data scientists develop a model, and “toss it over the wall” to engineers for productionizing it. This waterfall approach is very frustrating and often fails. Cross-functional teams with end-to-end responsibility
are more likely to deliver results faster.
Operations (DevOps or DevSecOps) is the discipline of continuous integration and continuous delivery/deployment (CI/CD). In the case of MLOps, it becomes CT/CI/CD
: continuous model training, integration, and delivery/deployment.
The aim is to automate the process of training models, integrating and packaging them into software services (typically docker containers), deploying them on the cloud, monitoring their performance in production (e.g. catching concept/data drift), firing alerts in case of issues, triggering rollbacks or retraining as and when needed.