Apache Kafka Monitoring
RTView's Solution Package for Apache Kafka provides a complete Kafka monitoring solution with pre-built dashboards for monitoring Kafka brokers, producers, consumers, topics, and Kafka Zookeepers. With over 30 pre-defined alerts and over 15 pre-built monitoring dashboards, users can deploy quickly without the time, skill and expense necessary to build or configure their own monitoring applications or dashboards.
The Solution Package for Kafka can be used to provide a Kafka-only monitor or within a larger RTView Enterprise Edition system that also provides visibility into the complementary technologies that make up a Kafka-based service or application. This enables users to cross-correlate and perform root cause analysis across the entire application stack for different types of components, including middleware (Kafka, TIBCO, Solace), databases, VMware, and open source, from the host and network level through the database and middleware layers up to the application itself. And RTView can centralize all this across on-premise, cloud-based, and hybrid environments.
End-to-End Visibility across Your Entire Application Deployment Environment
RTView® Enterprise Edition is a platform inside which any number of solution packages can be hosted, aggregating component-level data and viewing/managing/correlating these data as a whole. Without RTView® Enterprise Edition, users would be able to view only each individual data source, albeit with advanced visualizations and real-time granularity, but without the ability to aggregate and compare critical event data across all the relevant components of an enterprise deployment environment.
RTView® Enterprise Edition enables direct or indirect collection of data at the Solution Package level. Solution Packages are available for multiple off-the-shelf components across all of the tiers in common deployment architectures, including infrastructure (hardware, OS, network, storage, and database), middleware (app servers, data grids, messaging systems, SOA and event processing servers, etc.) and UI/UX components and processes.
Thorough troubleshooting requires that you understand how problems occur and develop so that you can make sure that they don't happen repeatedly. RTView intelligently caches data in-memory for instant access and data can be stored persistently for long-term capacity analysis.
Our time-series trend charts easily differentiate between transient spikes and slow-growth trends so that users can respond appropriately. Users can access a rich set of historical performance data going back months or even years for troubleshooting and failure analysis.
With RTView Enterprise, users can group application components, such as Kafka brokers and zookeepers, by logical groups. Groups can be defined as applications, services, business units, data centers or other geographical entity and can include a variety of different components from different application tiers such as other middleware or virtual hosts. For application support teams, this capability is critical so they can understand the business criticality of issues to more effectively prioritize and respond to events.
- Kafka Broker Alerts (11)
- Kafka Cluster Alerts (2)
- Kafka Consumer Alerts (6)
- Kafka Producer Alerts (7)
- Kafka Zookeeper Alerts (6)
Alerts can be centralized and managed from multiple middleware applications and other monitoring systems. Alerts can also be integrated and workflow automated with third-party monitoring and service desk applications such as ServiceNow or Remedy.
Monitor Kafka Brokers
Brokers are the heart of Apache Kafka and manage critical functions such as partitions, reads and writes, updating replicas with new data and flushing expired data. RTView monitors dozens of metrics provides early warning for key metrics such as under replicated partitions and slow purge rates.
- Broker State
- Partitions (count, offline, and under-replicated)
- Purgatory (fetch, heartbeat, produce, rebalance, topic)
- Metered Metrics (mean rate, 1-min, 5-min and 15-min rate)
- Histogrammed Metrics (min, mean, max, std dev, and percentile)
Timer Metrics (min, max, mean, std dev, 1-min, 5-min, 15-min rate, and by percentile)
Monitor Kafka Topics
Monitor the performance of all Kafka topics and drill-down for detailed information on topic performance on individual brokers.
- Messages (in per second)
- Bytes (in/out/rejected per second)
- Produce Requests (total/failed per second)
- Fetch Requests (total/failed per second)
Trend data (count, mean rate, 1-min/5-min/15-min avg)
Monitor Kafka Producers
Monitoring the health of Kafka producers can provide an early warning indicator to overall application health. It is important to identify any unusual trends as soon as possible so that you can react quickly to intercept growing problems. Metrics include:
- Producer Count and Status
- Batch Size (min, max)
- Buffer (available bytes, exhausted rate, total bytes, pool wait ratio)
- Compression Rate
- Connection (count, close rate, creation rate)
- Incoming Byte Rate & Outgoing Byte Rate
- IO (ratio, TimeNS avg, Wait Ratio, Wait Time Millisec Av)
- Produce Throttle Time (avg, max)
- Record (error rate, queue time avg/max, retry rate, send rate, size avg/max, per request avg)
- Requests (latency avg/max, rate, size avg/max, # in flight))
- Response Rate & Select Rate
- Waiting Threads
Monitor Kafka Consumers
The performance of consumers can also provide an early warning indicator that something is wrong upstream from Kafka. Finding these problems early can be the key to getting back on track quickly. Metrics include:
- Bytes Consumed Rate
- Fetch (latency avg/max, rate, size avg/max, throttle time avg/max)
Records (consumed rate, lag max, per request avg)
Monitor Kafka Zookeepers
Kafka uses Zookeeper to manage all the Kafka brokers including electing a controller, cluster membership, topic configuration, quotas, ACLs and consumer membership. RTView monitors critical Zookeeper metrics including:
- Latency (min/max/avg)
- Connections & Outstanding Requests
- Nodes & Watches
- Packets received & sent (delta & rate)
- Client Connections Per Host
Session Timeout (min, max)