Production Testing Challenges - Venkat Ramakrishnan

There are discussions in LinkedIn about the calamity in real-time situations, and how Software Testing need to plan for that too and not just controlled paths. In this blog, let us look at production testing challenges and how to emulate production environments, the steps, and the aids required.

Most situations in production are not foreseen by the test strategies of internal test planning, not at least in the sequence of test steps that can cover a failure from occuring. Several reasons include:

‘Drift’ of software from normal operating conditions
Reproducibility
Real-time circumstances that won’t typically happen
Unavailability of infrastructure like a real world condition

Clogging of data over a period of time, degeneration of system state because of repeated access and varied types of accesses, databases getting corrupt, hardware becoming faulty, and system internal state getting corrupted, etc., can all happen because of long-term usage. These things happen in real world and it is very difficult if not impossible to reproduce them in the internal test. It is tough to reproduce these, and many times, organisations won’t have the infrastructure as in a real world condition to emulate that complexity.

While controlled calamities can be somewhat emulated, real-time situations involving micro-services are complex. I have been writing about architecture models for the past few days, and I find it valuable to have layered approach of prevention of degrading for both the system and the data. Also, observability tools are a must-have in production environments.

While talking about data, of course you can’t store data endlessly for years that would provide clues about the build-up that leads to system failure. There could be a strategy as to what data to keep and what data to let go. Data that is no more relevant or that has become obsolete (with respect to the application) can be chucked off. The key thing to realize is that there would be multiple layers of data and those layers need to be protected through security mechanisms.

I talked about code coverage some time back which gives some clue about areas of code that are NOT executed and which need a closer look. Code that’s not covered could become potential cause for systems failing in production when that code gets executed unexpectedly and failures exposed.

Overall, it needs a thorough production testing strategy involving architecture, design, code, and data to mitigate the possibilities of a production failure. I hope that you would find this area interesting and be willing to invest time and effort in exploring further. For detailed production testing challenges and strategy in your organisation, feel free to contact me.

Leave a Comment Cancel Reply