Test Data Generation

Quality Of Test Data

I’ve been working on Machine Learning recently, and I come across situations where I need to have specific data points to test if a machine learning model is performing as expected. It is tricky to do Software Testing for such scenarios where there is probability involved, and the conclusions made by the model are only guesses or recommendations, and not cannot be assured totally. This is the challenge with test data generation and quality of test data.

Even with a regular software being tested (and not a machine learning model), we need to be careful that we are generating the appropriate data sets for testing based on ranges, boundaries, etc. When there are multiple types of data that interact with each other, it gets even more difficult and interesting.

I find that there are some suggestions provided by some websites to generate test data:

  • Manually generate test data
  • Test data generation through SQL injected into a database that needs to be tested
  • Third-party tools (model-based tools)
  • Automated test data generation (based on code scanning and other methods)

Model-based tools are very helpful in building libraries of test data and using the data set at the appropriate instance. But I do see several shortcomings in this approach. If the model is specified wrong as input for the tool, the resultant test data and the test strategy could be totally wrong. But this approach is very useful for front-end design and testing.

Automated test data generation can work for some scenarios like systems programming, where you can generate data for the backend. This is possible using tools which are similar to code generators which analyze another piece of code and generate data based on the analyzed code.

Manually generated test data can be exhaustive and comprehensive, but soon leads to combinatorial explosion and fatigue. It is useful for small scenarios, but will go out of control for large and complex systems.

In short, no one method fully takes care of test data generation. It might become a combination of approaches for a specific software. It needs experience and expertise to figure out the right approach for test data generation. Please feel free to contact me if you need inputs in this regard for your organisation.

Leave a Comment

Your email address will not be published. Required fields are marked *