Performance Co-Pilot Performance Co-PilotTM

Monitoring and Managing System-Level Performance

Performance Co-Pilot is an exciting family of products from SGI that delivers system-level performance monitoring and management services. Performance Co-Pilot is designed for both operational monitoring (tactical performance management) and the in-depth analysis that is needed to understand and manage the hardest performance problems in our most complex systems (strategic performance management).

Performance data can be collected by Performance Co-Pilot from the hardware, the operating system, layered services, end-user applications, the network, and distributed solution architectures. Performance Co-Pilot tools process this information to provide a complete picture of those factors influencing resource utilization, bottlenecks, and end-user performance, delivering an unprecedented power and flexibility in the ways you may view and manage the performance of your system.

Features

Performance Co-Pilot This document describes the product features of Performance Co-Pilot™, and includes the following sections.

The Performance Co-Pilot Product Overview provides a shorter description of the product capabilities.

Product Positioning and Background

Performance Co-Pilot is an exciting family of products from SGI that provides a suite of tools that cooperate to deliver system-level performance monitoring and management services. Performance Co-Pilot is designed for both operational monitoring (tactical performance management) and the in-depth analysis that is needed to understand and manage the hardest performance problems in our most complex systems (strategic performance management).

The focus of Performance Co-Pilot is on system-level performance, where the contributing factors may span multiple areas, namely:

  • The system hardware
  • The operating system software, including the kernel
  • Layered services (as provided by SGI, ISVs, and local infrastructure applications)
  • End-user applications (in particular those with mission-critical objectives)
  • The network
  • Distributed solution architectures that involve multiple hosts (client-server, federated servers, etc.)

Given the diverse ownership of the areas and the radical variations in application mixes and operational environments between systems, Performance Co-Pilot has been designed to deliver a rich and powerful collection of services, data collection tools, performance metric delivery infrastructures, and tool kits that can be deployed, configured, extended, and customized to meet the performance needs of individual customers and sites.

Performance Co-Pilot collects and makes available low-level performance data from the hardware and the operating system, and more abstract or application-specific performance data from layered services (such as domain name servers, Web and e-mail servers, and RPC servers), environmental monitors, Cisco® routers, response-time probes, and other "interesting" processes (e.g., those making excessive resource demands). Libraries, tools, debuggers, and source-code examples are provided to encourage new agents to be developed and deployed to export performance from end-user applications and quality-of-service probes. Performance Co-Pilot is targeted at those with an interest in overall system performance: performance analysts, benchmarkers, engineering developers, database administrators, capacity planners and system administrators.

Visualization of an SGI Origin System

Uniform Naming and Access to Performance Metrics

The Performance Co-Pilot protocols and interfaces provide an abstraction that hides all of the implementation details from multiple domains of performance metrics (e.g., where the performance data comes from, who owns it, and how it was collected). Metadata describing the format, interpretation, units, and scale of the data is also provided so that the data semantics can be discovered at run-time and the semantics may change over time without requiring changes to the applications that process the performance data.

At the lowest level, performance metrics are collected and managed in autonomous performance domains, e.g., IRIX®, a Web server, or an end-user application, and the Performance Co-Pilot infrastructure reflects this with independent Performance Metric Domain Agents (PMDAs or plugins) for each domain, as shown below.

PCP infrastructure

The Performance Metrics Collector Daemon is a message routing server, accepting requests from the client monitoring tools, forwarding the relevant components of the request to each PMDA, co-ordinating the responses from the PMDAs, and sending a single reply to the client tool.

Flexible Logging and Retrospective Analysis

Often, performance analysis is expedited when it is possible to compare today's end-user performance, activity levels, and resource utilization against the same information from yesterday, last week, or last month. This form of retrospective playback is most useful in problem analysis, hypothesis evaluation, remote diagnosis, and capacity planning.

The Performance Co-Pilot archive logger may be configured to collect the necessary information with user-defined coverage (in terms of the scope and level of detail of the desired performance metrics) and frequency. The profile of performance data being logged can be changed dynamically to accommodate changing levels of system activity and/or collect more-detailed information over a short period of time for later analysis. Archive logs may be accumulated either at the host being monitored, at a monitoring workstation, or both.

A universal replay mechanism (modeled on a VCR paradigm) is used by most Performance Co-Pilot tools to provide "stop, seek, rewind, and replay at variable speed" processing of historical performance data . The requirement for uniformity also leads Performance Co-Pilot to treat real-time and historical sources of performance data as interchangeable and semantically equivalent. A set of scripts and control files combine to provide integrated management of the process of collecting Performance Co-Pilot archives, including automatic starting and monitoring of the logger processes, daily log rotation, log culling, log merging and extraction, and flexible deployment of the logs and logging processes across multiple hosts.

Distributed Operation

From a purely pragmatic viewpoint, a single workstation must be able to concurrently monitor the performance of multiple remote hosts. At the same time, a single host may be monitored from multiple remote workstations. Performance Co-Pilot uses a classical "client/server" architecture to provide seamless and concurrent access to performance metrics, independent of their host locations. In this way, Performance Co-Pilot enables centralized performance monitoring and management for highly distributed application deployments.

Automated Reasoning

Performance Co-Pilot provides an inference engine that evaluates a set of assertions against a time-series of performance data collected in real-time or from one or more Performance Co-Pilot archives. For those assertions that are found to be true, the inference engine is able to print messages, activate alarms, write syslog entries, and launch arbitrary programs.

Typical use of automated reasoning about system-level performance might include:

  • Monitoring for exceptional performance conditions
  • Raising alarms
  • Automated filtering of acceptable performance
  • Early warning of pending performance problems
  • Automating the initiation of corrective action
  • A "call home" to the support center
  • Retrospective performance audits
  • Evaluating assertions about "before and after" performance in the context of upgrades or system reconfiguration
  • Hypothesis evaluation for capacity planning
  • Use as part of the post mortem analysis following a system failure

Visualization of Exported Performance Data Visualization of CPU utilization

  • 2-D graphical monitors include: a strip-chart tool (pmchart) to display trends over time and a tool to display LED indicators and utilization meters (pmgadgets)
  • A generalized 3-D Inventor™ application (pmview) supports dynamic displays of clusters of related performance metrics as height - and/or color-modulated objects on a common base plane; visualizations generated by pmview are customizable to construct arbitrarily complex scenes with objects modulated by the performance metrics from one or more hosts and one or more domains of interest
  • Special-purpose visualization tools are provided for SGI® Origin® family systems, array platforms, and FailSafe™ clusters (where IRIS FailSafe™ is deployed)
  • Assorted text-based tools display arbitrary groups of performance metric values, suitable for ASCII logs or enquiry over a low-speed connection
  • A host-based security model provides optional control over the execution of Performance Co-Pilot service requests from designated remote hosts and/or workstations.
  • Fully exposed and documented APIs (including sample source programs) are provided for a variety of application development tasks including the following libraries:
    • libpcp for building site-specific or application-specific performance monitoring tools; this library provides access to all of the services of the underlying Performance Co-Pilot infrastructure.
    • libpcp_pmda for creating agents to collect new performance data.
    • libpcp_trace provides a simple interface for instrumenting applications and exporting the resultant performance measures using the Performance Co-Pilot trace domain agent; library bindings for C, C++, Java™ and Fortran are provided.

Customization and Adaptation

For many end users of Performance Co-Pilot, the most important and useful performance metrics are not those supported by the shipped Performance Co-Pilot product, but rather are new performance metrics that characterize the essence of "good" or "bad" performance at their sites, or within their application environments. An example would be application instrumentation that counted and exported transaction service times, the rate of progress on solving a problem, operations completed, queue length of pending tasks, etc.

Performance Co-Pilot provides libraries, source code examples, tools, and debuggers that encourage the development and integration of new sources of performance metrics as peers into the collection infrastructure. In the simplest case, agent development involves no more than writing a single function in C to instantiate metric values on demand, with all communication, protocol handling, and administrative services delegated to Performance Co-Pilot libraries. Production-quality agents have been routinely developed in a matter of a few hours, and source code examples are included in the Performance Co-Pilot distribution.

For applications where source code is available, another Performance Co-Pilot library supports a simple API for collecting measurements of application activity and aggregate elapsed time for arbitrary operations. The library automatically arranges for the collected data to be exported to the Performance Co-Pilot framework using a purpose-built domain agent (the trace PMDA).