Running a modern IT platform is rarely an easy nor isolated task. Most platforms consist of a fairly large number of components ranging from OS level to 3. party libraries and components added in the user interfacing layers - and adding numerous integrations does make it an interesting challenge to quickly identify and correct bugs and errors.
While the system complexity does pose a challenge is surely not an impossible task, as several tools exists for most - if not all - platforms to allow instrumentation of the platform and utilize the instrumentation tools to handle the platform and identify issues quickly.
Instrumentation provides insight…
Instrumentation tools are generic tools which allows you to identify errors in your production environment and provide sufficient context and debug information to allow developers to diagnose, understand and fix the issues identified. Examples of such tools include AppDynamics, New Relic, Stackify and many others. If you’re the Do-It-Yourself type, it’s not unfeasible to build a tool yourself by hooking into error handlers and other hooks exposed by the specific platform due to be instrumented.
Having worked with various degrees of instrumentation for 10+ years - homebuild and purchased tools, I can certainly confirm that such tools works and allows you to mature a complex IT platform much quicker, as the insights provided from a live production environment allows you to attack the most occurring errors experienced by real users of the system.
Test suites are great for minimizing risk during development, but the test suites are based on assumptions on how users and data acts in your platform, and while the identified errors experienced over time certainly help minimizing risks in new development, it is “theory” as opposed to instrumentation which is much more “practice”.
Transparency not needed
While the tools to do instrumentation for most platforms may readily be available, the “natural use” - even in an enterprise setting - seems surprisingly low, and I suspect numerous reasons exists.
We do not need it is often the most common. As set procedures exists and they seem to work, why would we need to introduce a new tool to provide data we already have. Error logs, end-user descriptions and screenshots have been used for decades and why should there be a better method?
It introduces risk is another often cited concern. As instrumentation tools are not considered a need tool in the IT platform, operations may oppose to adding it to the already complicated stack - especially if the value of the instrumentation is not known or recognized.
It is expensive is another misconception. Instrumentation often don’t provide any direct business value (assuming your IT platform isn’t burning and the users is leaving rapidly). Most of the value offered by instrumentation tools is fixing issues faster and the scope of issues being smaller, and as such it’s often hard to prove the value offered by issues not occurring.
Transparency not desired
Apparently many people believe firmly, that issues not seen nor reported are not real issues, and does not exist. Gaining insights into one instrumented platform and running a black-box platform next to it, may cause the false belief that the black box system is running more stable and with fewer issues than the transparent system.
The reason is simply that on black box systems (that is systems without any instrumentation tools to monitor their actual performance) it is rare to proactively examine logs files and other places where the black box might emit issues. Only when an issue is reported, developers are assigned to examine these sources to resolve the issue.
Gaining insights into an IT platform though instrumentation and being able to resolve “real” issues as experienced by your users should be a fantastic thing, but beware that many people implicitly seems to believe, that with you don’t monitor for errors and issues, they probably doesn’t exist - however false it is.