“CI/CD for Machine Learning” might be a culture conflict

Over the past two years, there has been quite a few discussions about how to employ continuous integration and continuous delivery (CI/CD) tools in machine learning. As practitioners of applying software engineering principles to data science projects, we are more than happy to see this trend towards a safer and smoother workflow.

However, CI/CD practice embodies a specific mindset and a team culture, instead of being just a tool.

We feel that in order to get the benefits of automated build, one should consider the following things carefully. Simply setting up a CI/CD pipeline and start putting out ML services may not get you the values you look forward to. You might even break your current data science practices, and your model accuracy might suffer as a result.

Not only a tool but a culture

The term “Continuous Integration” originated from Kent Beck’s Extreme Programming methodology. Under the practice, every team member should closely follow a shared codebase. The code repository could be updated several times a day, but everyone should be at most a few hours behind the latest state.

The idea behind continuous integration is that this helps minimize integration problems. Since everyone is no more than a few hours away from the latest codebase, problems can be resolved quickly. In a team adopting CI practice, integration should be a non-event.

Culture clash between engineering and science

When adopting CI practice in machine learning or data science projects, chances are you will be caught between a “culture clash” between engineering and science.

A big difference between the two disciplines is that, the end result of machine learning is not a deterministic model. There are many random factors in training and validating a machine learning model. Even if the training process is codified in detail, the resulting models often would still have minor variations in their inference results.

Therefore it is hard for engineers to test machine learning models. You can specify a fixed validation/testing set, and setup acceptance tests in terms of some metrics, but there will need to be an open interval of tolerance. The threshold will also seem arbitrary in many cases.

Worst of all, if there is a misstep in training a new version of the model, it could manifest as only a slight decrease in its accuracy, which might not be caught by the acceptance tests, because that is the nature of non-deterministic tests.

Automated tests are critical to CI/CD practice. However, a set of reliable tests for machine learning models is hard to come by.

Testing for bugs vs testing for overfitting

In order to test machine learning models, it is also tempting to run what data scientists call “testing score” at the CI/CD stage as an acceptance test. This might bring more harms than benefits, because first of all as we mentioned, these tests are not deterministic. More importantly, it could break the main tool data scientists use to prevent overfitting.

Probably most models in machine learning are developed using a training, validation, and testing set split. This means that, when data scientists acquired some volume of data, after cleaning it up and building a dataset out of it, she then divide the dataset into three separated sets for different purposes.

The training set is used to train the model. The validation set is used to measure the model’s performance immediately after training it. The model might perform well on the training set, but it does not mean that the model is good for general use. If the model is accurate on the training set but not on the validation set, it means that the model performed well only because it has memorized too much about the training set.

In other words, the model only works well on data it has been learning from, but its knowledge cannot generalize to data it has not seen before. That is what we call “overfitting” in data science. It is something that every data scientist battles with.

We could summarize the different views towards “good software” by saying that software engineering seeks reproducibility, and data science seeks generalizability. That is why the word “test” has different meanings in the two disciplines.

The role of testing set

So what is the testing set for? In the process of building a model, we would try many different things and retrain the model a lot. The validation set is used to measure its generalizability every time we have a new model. Hopefully we can at some time get good scores on the validation set. However, good performance on validation set might be only because we have tried so many different things on the dataset, that we happen to bump into something that works well on the validation set. In other words, we might be overfitting to the validation set, too.

That’s why we need a testing set. At the end of developing the model, when we have finished our work and are confident about the end result, we test the final model, one time, on the testing set. Since testing set is the only piece of data that the model has never seen before in the entire cycle of development until now, this tells us if the model generalizes to unknown data.

If the model performs well on testing set, we are then fairly confident that it should perform reasonably well on unknown, real world data. If on the other hand it performs badly, we know we have made bad mistakes, and that the model has overfitted to the validation set, too. This is the last safeguard of model quality before putting it in production.

But it only works if the model has never seen the testing set before. That’s why testing sets should be used only when you are certain that all work has been finished, that the model is good for release. At that point should the model be put to test on the testing set.

The effects of CI/CD workflow to data scientists

By utilizing a CI/CD pipeline, models will be tested every time it is updated. In this typical workflow of data scientists, model trainers will see test results every time she makes a new model.

Therefore you shouldn’t run data scientists’ testing scores as acceptance tests. Putting models to testings sets at this stage will let trainers check model’s performance every time a new one is built. You would risk overfitting to the testing set, and there will be no other means to act as a safeguard.

However, you should run testing scores before releasing the final model, to see if it generalizes well.

So to properly incorporate CI/CD workflow to a data science project, you should design a time point at which to do this final check. The choice of the time point, and the frequency of running testing scores, depends on when do you count a model as “finalized”, and on the availability of up-to-date and new testing sets.

This is another reason that the release cycle of machine learning models is often not in sync with the release cycle of other parts of the codebase, aside from the time-consuming nature of retraining models.

Therefore it is often difficult for data science projects to follow a “continuous” pace in integrating with other parts of the projects. This is something you will need to plan ahead before introducing CI/CD pipeline to the project.


In this post we reviewed the idea underlying continuous integration and continuous delivery. We explained why it worked well in software development, and how a culture clash may incur when introduced in the discipline of data science. In particular, treating CI/CD workflow as merely a software development tool instead of a team culture may put you at the risk of overfitting your model and not knowing it until you put it on production.

We are very excited to see more and more people care about software quality of machine learning models, and are trying out new approaches to address it. In the future, we hope to share some more of the experiences we have in this direction, and of the things you should be careful about in applying software engineering practices to build safer and more robust machine learning products.

Leave a Reply