Top Down vs. Bottom Up in Application Monitoring – Making The Birds-Eye View Work for You


End to End Monitoring

bird2Monitoring complex application health is hard. Millions have been spent and many have tried over dozens of years, but few have succeeded. Complex applications are built on a slew of moving parts that have evolved over time into a spider web of event servers, workflow engines, message queues, packaged applications, adapters, ETL servers, and thousands of Java nodes, VMs and physical hosts – all spread across a network of global data centers. To date, no single solution has been able to solve the triple-threat of total visibility across multiple tiers and multiple vendor stacks while complying with very strict SLAs.

Forests and Trees: The Downside of Bottom-Up
treesLots of monitoring solutions claim to be a fix for this thorny problem. One class of such solutions that organizations have traditionally relied upon looks at healthstate from the bottom up. You can see this approach in commonly used code profiling and transaction tracing technologies. Newer entrants, such as log mining solutions like Splunk, take a similarly granular approach. These tools do a great job of seeing what’s going on at the most basic computing level and finding anomalies through lexical pattern matching that indicate risks to application healthstate; or, by looking backwards, where and why an outage may have occurred.

One major problem with this bottom-up approach is that it’s expensive: expensive in that it requires agents or code injection in order to work; expensive in terms of the massive amounts of storage required for capturing every bit of information from every log or function call over time.

These technologies have been implemented in thousands of shops worldwide – but rather than being a complete solution for enterprise application monitoring, these tools have proven to be but a single weapon in a well-stocked monitoring arsenal.

To figure out why, let’s look at this problem in a bit more detail. To track code execution, one would need to inject a marker into production code, which is a very risky procedure. Altering production code can have implications on both performance and stability, making code execution analysis tools far less attractive for production environments. The same goes for transaction tracing, log mining and the like. The expense of continuously collecting and analyzing excessive diagnostics illustrates the down side in using a bottom-up approach to healthstate monitoring.

The Aerial Perspective – Gaining the High Ground
aerialConsider the opposite approach. It’s natural to stake out the high ground. Predators in the wild as well as modern search-and-rescue operations take advantage of aerial perspective to save time and energy in directing a hunt. Large swaths of forest or ocean can be covered far more quickly and easily from above. Before gaining any clues as to where to look, it makes little sense to start combing the ground inch-by-inch right below our feet. Getting the lay of the land from a certain distance can make all the difference in finding tonight’s dinner or locating lost hikers.

For many IT organizations, that type of perspective is often elusive. Modern businesses depend upon a portfolio of applications – many highly customized and unique – that in turn depend upon a number of traditionally siloed technologies (hosts, VMs, RDBMSs, middleware) to ensure overall application health. Yet very few of those organizations have implemented solutions that depict application health from the 30,000-foot level.

For example, if I’m the VP of eCommerce, I’d want to visualize my entire application portfolio first so I could see at a glance where any problems are brewing. Then I’d want the ability to drill down into the supporting architectural elements for the affected application or service so that I know what individual or ops team to contact. Without a top-down representation of these applications and underlying technologies, this type of action would be nearly impossible to achieve.

For portfolios of bet-your-business applications that can’t degrade or fail, it makes sense to begin with a top-down rather than bottom-up approach. Only by looking at those portfolios in the context of their importance to the business can we prioritize attention on those apps that matter most, and those conditions that are the most severe. Even with the best CMDB and alert management system, viewing application health as merely the aggregate of the compute state metrics sacrifices critical application context – and thus prevents us from organizing the potentially thousands of related and interdependent configuration items in a typical complex application environment. There must be some way to easily normalize, query and visualize very large data sets that cross computing tiers and vendor stacks. And there must be a framework that allows for modular evolution over time, as the mix of commercial and open source components continues to stretch and morph in nearly every shop.

The ability to start at the application portfolio level, and present dashboards and visualizations to any number of different roles – development, QA, operations, infrastructure, PaaS, change management – is absolutely critical. By leveraging a common data set and business rules, customized views of application health by domain enable the business to share monitoring responsibility with IT, reducing MTTR and much of the finger-pointing that currently goes on when a critical app goes down. A top-down system should enable one to quickly survey the application landscape and see trends as well as real-time danger signs. And once you know where to look, then it makes sense to bring out the microscopes.

Here at SL, we believe that the ability to view your application infrastructure from the top-down provides better visibility and enables you to identify both real and potential problems at the infrastructure level much quicker when viewed in the context of the applications that these infrastructural components support.  For more information on end-to-end application and infrastructure monitoring using RTView Enterprise Monitor, visit us at www.sl.com.