When it comes to machine learning testing, it is quite different from software testing. It is not yet as mature and well understood as traditional testing.
For deploying in production, you should not focus solely on model evaluation, but also tests models for slices, runtime performance, bias, security, etc.