To be agile, or not to be
Newsletter Issue 3: Should data science embrace Agile? Why and why not?
I guess the answer depends on whom you ask.
I have seen many Data Scientists bitterly oppose Agile and Scrum:
The crux of the argument is that Data Science is science and not engineering. Therefore:
Estimating the time requirement is very difficult.
Its nature is not iterative: unlike software, you can’t build a piece that partly works, and then fill in more pieces to make it more complete.
Its nature is water-fall: when an idea doesn’t work well, you might have to go back all the way to tweaking the problem formulation and collecting a different kind of data.
Agile means more meetings (stand up, sprint planning, retrospective, etc.) and less work.
Agile means constant change of priorities (as a consequence of constantly evolving understanding of requirements and business needs).
Agile Methodology makes you mechanical and hinders creativity.
In some sense, and to some extent, all of it is true.
Déjà vu for “old” enough Software Engineers.
Interestingly, software engineers who are old enough will feel déjà vu. Programmers had the same arguments in the late 90s:
Programming is part art and part science. It is a highly creative process.
Estimating software development efforts is a notoriously hard problem.
When you discover a problem in the software design, often you have to go back to the very beginning (i.e. it’s waterfall-ish).
Do you want me to sit in so many meetings for requirement review, design, estimate, integration plan, test plan, or do you want me to code and finish the stuff?
And here we are! Now most developers follow some kind of iterative process, and data scientists often think that engineers and managers don’t get “science and research”.
Just as then software was (and is) just a means to an end, even now data science and machine learning are means to the business goals.
So, what can we do?
I believe that with time, we will figure out how to manage the unpredictabilities of data science better, just as we figured that out for software development.
First, let’s step back and revisit the Agile manifesto:
Individuals and interactions over process and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Response to change over following a plan
Instead of rituals or Agile, we need to go back to the essence and adapt it to machine learning.
In my experience, I have found that the following improves the probability of successfully deploying a machine learning project and making a business impact:
Consolidate Ownership: Cross-functional team of product, developers, and data scientists responsible for the end-to-end project.
Integrate Early: Implement a simple (maybe even a rule-based) model and develop product features around it.
Iterate Often: Build better models and replace the simple model, monitor, and repeat.
Consolidating into a single team cross-pollinates data scientists and developers of each-others requirements early on.
Counterintuitively, integrating early actually decouples model and software development (that great software engineering principle: cohesion over coupling), and follow a different cadence yet being in the same rhythm.
It has started.
Some of it is already happening:
Eugene Yan wrote a 3 part series on Data Science and Agile: what works and what doesn’t, frameworks for effectiveness, and what he loves about Scrum in data science.
Agile Data Science 2.0 by Russell Jurney (chapter of a book published by O’Reilly)
Unpopular Opinion: Agile is not only suitable for Data Science projects, but it is the only way to run one by Laszlo Sragner
Can Data Science Be Agile? Implementing Best Agile Practices to Your Data Science Process by Jerzy Kowalski
So, what do you think? What parts of Agile philosophy and process are suitable to adopt in data science and for taking machine learning to production? Please share your thoughts in the comments.
ML4Devs Newsletter - Issue 03, published on 11 Feb 2022.