ML Model Testing (ML4Devs Newsletter, Issue 2)

#2・
1.34K

subscribers

7

issues

Subscribe to our newsletter

By subscribing, you agree with Revue’s Terms of Service and Privacy Policy and understand that Machine Learning for Developers will receive your email address.

Machine Learning for Developers
Is the model's performance score enough?
If your machine learning model has a high correctness score on the holdout test data set, is it safe to deploy it in production?
All models are wrong, but some are useful.
– George E. P. Box (famous British statistician)
But the question I am asking is: Are more correct models more useful?
Recently, we trained a Speech Recognition for a customer, and the accuracy was higher than the given goal. On close examination of errors, we found that model did particularly poorly on transcribing numbers. Clearly, just model accuracy evaluation is not sufficient in deciding if the model is good enough to deploy.
Model accuracy tests are just the tip of the iceberg. Image source: Unsplash https://unsplash.com/photos/M-EwSRl8BK8
Model accuracy tests are just the tip of the iceberg. Image source: Unsplash https://unsplash.com/photos/M-EwSRl8BK8
Model Evaluation vs. Model Testing
In machine learning, we mostly focus on model evaluation: metrics and plots summarizing the correctness of a model on an unseen holdout test data set.
Model testing, on the other hand, is to check that model’s learned behavior is the same as “what we expect.” It is not as rigorously defined as model evaluation. Combing through model errors and characterizing errors (like I did in speech recognition and found the problem with numbers) is just one kind of testing.
For a rundown on pre-train and post-train, see Effective testing for machine learning systems by Jeremy Jordan. For some example test cases for the same, see: How to Test Machine Learning Code and Systems by Eugene Yan.
Model Explainability or Model Interpretability
The degree to which a model’s outcome can be understood by humans, and its decision-making “logic” can be explained. At least for non-DNN models, this is a very important part of testing.
10 Types of ML Tests
Dr. Srinivas Padmanabhuni list 10 types of tests that cover model evaluation, model testing, inference latency, etc., in 10 Tests for your AI/ML/DL model:
  1. Randomized Testing with Train-Validation-Test Split: Typical test to measure model accuracy on unseen data.
  2. Cross-Validation Techniques: Measure performance over several iterations of the splits of the data, e.g., K-Fold, LOOCV, Bootstrap.
  3. Explainability Tests: Useful when models (like DNNs) are not interpretable. Mainly of two types: model-agnostic and model-specific tests.
  4. Security Tests: To guard against adversarial attacks with poisoned data to fool the models. Again, two varieties: white-box (with knowledge of model parameters) and black-box.
  5. Coverage Tests: A systematic approach to ensure that unseen data is diverse enough to cover broad varieties of input scenarios.
  6. Bias / Fairness Tests: To ensure a model does not discriminate against any demography.
  7. Privacy Tests: To prevent privacy attacks/breaches. Model inference should not make it possible to figure out the training data, and the inferred data should not have PII embedded in it.
  8. Performance Tests: Whether the model inference happens within the latency SLAs of the use case.
  9. Drift Tests: To guard against concept/data drift.
  10. Tests for Agency: The closeness of model outcome to human behavior.
Detailed Examples with Code Samples
Similar to Eugene Yan’s article, but longer and with a different emphasis and more detailed code examples:
Summary
When it comes to machine learning testing, it is quite different from software testing. It is not yet as mature and well understood as traditional testing.
For deploying in production, you should not focus solely on model evaluation, but also tests models for slices, runtime performance, bias, security, etc.
We are all trying to figure out model testing. Image Source: Pixabay https://pixabay.com/vectors/ancient-blind-boys-brain-cartoon-2026111/
We are all trying to figure out model testing. Image Source: Pixabay https://pixabay.com/vectors/ancient-blind-boys-brain-cartoon-2026111/
If you enjoyed this issue, please share it with your team members. Please connect on Twitter or Linkedin, and send your feedback, experiences, and suggestions.
Did you enjoy this issue? Yes No
Satish Chandra Gupta
Satish Chandra Gupta @scgupta

ML4Devs is a biweekly newsletter for software developers.

The aim is to curate resources for practitioners to design, develop, deploy, and maintain ML applications at scale to drive measurable positive business impact.

Each issue discusses a topic from a developer’s viewpoint.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.