1. Service Level Agreement (SLA) Widget
Contents
-
Service Level Agreement (SLA) Widget
- Important Note
- Objectives
- Understanding Alerts
- Understanding Reports
- Edit Widget, Step 1: Creating the IT Service
- Edit Widget, Step 2: Service Coverage, Service Level, and alert
- Visualization: Service Monitoring for the SLA Manager and Service Owner
- Visualization: Service Monitoring for 1st Level Support
- Understanding Why a Service Is Down
- Technical Annex
1.1. Important Note
This widget is available under extra specific commercial agreement. Please contact your NEXThink sales contact if you need this widget.
1.2. Objectives
Service Level Agreement widget for NEXThink Portal (called SLA widget in this document) offers IT Service Management module on top of the traditional network and system monitoring solutions. For the first release of this widget, Nagios will be the single solution supported.
The SLA widget covers two levels of the service oriented organization:
- Service Level - the level between an IT service provider and a customer.
- Operational Level – the level between an IT service provider and another part of the same organization, governing the delivery of an infrastructure service.
Even in some large organization, the Operational Level is contractually defined by an Operational Level Agreement (OLA), the SLA widget for NEXThink Portal simplified the structure by only defining the operational components structure providing the services to the customer.
The target users of the SLA widget are:
- The Service Level Manager and the Service Owner
Definition of ITIL v3:
The Service Level Manager is responsible for negotiating Service Level Agreements and ensuring that these are met. He makes sure that all IT Service Management processes, Operational Level Agreements and Underpinning Contracts are appropriate for the agreed service level targets. The Service Level Manager also monitors and reports on service levels.
The Service Owner is responsible for delivering a particular service within the agreed service levels. Typically, he acts as the counterpart of the Service Level Manager when negotiating Operational Level Agreements (OLAs). Often, the Service Owner will lead a team of technical specialists or an internal support unit.
- With the SLA widget, the SLA Manager and the Service Owner must be able to check regularly the status of the service levels, understand the status of the service and trends, react adequately and quickly to maintain the service at the right level, plan the future to improve or adapt the service level agreements.
Definition of ITIL v3:
The responsibility of 1st Level Support is to register and classify received Incidents and to undertake an immediate effort in order to restore a failed IT Service as quickly as possible. If no ad-hoc solution can be achieved, 1st Level Support will transfer the Incident to expert Technical Support Groups (2nd Level Support). 1st Level Support also processes Service Requests and keeps users informed about their Incidents' status at agreed intervals.
- With Nagios module, the 1st Level Support must known the status of each service (up/down) in real-time and understand the root cause of the problem if a failure occurs.
The next figure shows the mapping between the technical structure providing the IT services, the roles and responsibilities of the IT organization and the structure of the SLA widget.
Structure, Roles and SLA Widget correspondence schema
The IT services are based on systems and network devices usually configured and maintained by specialized engineers. The Operational components are basically some groups of devices, designed and configured to work together, in order to provide the needed IT services. These Operational Components are maintained by specialized teams, like for example the Network and Telecommunication team, the Microsoft team or the Business Applications team. Finally, the Operational components deliver specific services to the organization, like Email and Messaging, SAP or key business application.
1.3. Understanding Alerts
When a service goes down, or up again, an alert can be sent by e-mail.
In the widget configuration, e-mail addresses can be specified for this purpose (see below).
An alert is sent as a plain text e-mail to the specified reciepients, and looks like this:
from NEXThink Portal - dev <pb@nexthink.com> to somebody@gmail.com date Fri, Aug 28, 2009 at 12:12 PM subject [SLA Alert] SLA XYZ DOWN mailed-by nexthink.com Source: nxt-l23 Service: SLA XYZ Status: DOWN Time: 28.08.2009 @ 12:12:34 (Europe/Zurich) Service time: WITHIN for 12h 13min and still for 11h 47min Current SLA status SLA: 17.21% Target: 99.00% Forecast: 0.00% ------------------------------------------------------------------------------ This is an SLA service alert from NEXThink Portal. You received this email because you are configured to receive service alerts. Contact your NEXThink Portal administrator if you want to change or remove the subscribtion.
Notes:
- When a service is down while the Portal starts, and alerts are enabled for this service, an alert is sent.
See Portal Reports for SMTP configuration.
1.4. Understanding Reports
SLA widgets can generate report parts:
Note:
See Portal Reports for reports creation, and SMTP configuration.
1.5. Edit Widget, Step 1: Creating the IT Service
This step describes the creation and management of the IT Service in SLA widget.
The Service is a compilation of one or several interdependent Operational components, based on one or more Probes (Nagios check).
For example, we want to define the Internal Email and Collaboration Service of an organization. This service is built on a Microsoft Exchange cluster, connected on two redundant switches and dependent of the two Domain Controller servers of the organization.
The Service is then called “Email and Collaboration” and is a logical combination of the Operational components.
The Operational components are
- The two Exchange servers EXCH01, EXCH02
- The two fronts switches FESW01, FESW02
- The two domain controllers DC1, DC2
The Probes are based on Nagios checks representing the working status of the devices and applications.
1.5.1. Definning a Service
Create the Service:
1.5.2. Configuring the Operational Components
Configure the Operational components and the logical expression summarizing the Service dependencies:
1.5.3. Configuring the Probes and Checks
For each Operational Component, compose the checks defining it:
- Adding or editing a probe brings a dialog to locate and select a host, and select several Nagios checks:
1.5.4. Configuring the End-user Impact
Configure the End-user impact for the Service configured. The End-user impact is computed on NEXThink Engine and reflects the user perception of the service being down.
1.6. Edit Widget, Step 2: Service Coverage, Service Level, and alert
The Service Level Agreement includes at least the service hours, the service availability and the service review period. Alerts can be sent when the status of an operational component changes.
1.6.1. Defining the SLA
Define the SLA percentage and service time (aka coverage):
1.6.2. Defining alert recipients
In the Addresses for alerts field, the recipient can be specified as an e-mail address. Multiple recipients are separated by commas.
1.7. Visualization: Service Monitoring for the SLA Manager and Service Owner
The Service Owner needs two views in order to evaluate adequately the delivery of the service level agreed.
- For the active contract period:
- The SLA current status to check what is the situation today.
- The forecast SLA for the end of the period in order to prevent under performance.
- For the past contact periods:
- An historical view of the previous performances
1.7.1. Monitoring the Active Contract Period
SLA monitoring, current period
- SLA current level.
- SLA forecast based on the past event of the period (= resulting SLA if the past outage ratio stay the same until the end of the period).
- Selected period (mtd for month to date).
- Elapsed time in the active period.
- SLA target and outage for the selected period.
- Operational components details, with historical view for the selected period, % and # of impacted users, # users using each component, outage duration per component.
1.7.2. Monitoring Past Contract Periods
SLA monitoring, past periods
- SLA current level.
- SLA historical view for the full available period (line for the selected period).
- SLA target and outage for the selected period.
- Selected period.
- Operational components details, with historical view for the selected period, % and # of impacted users, # users using each component, outage duration per component.
1.8. Visualization: Service Monitoring for 1st Level Support
The first level support needs at least the current period view for checking the forecast SLA and the current service and operational component status.
Service monitoring
- Current service status. Two colors
- red: service down
- green: service up
- Within or outside service time (aka coverage).
- Current period.
- SLA forecast based on the past event of the period (= resulting SLA if the past outage ratio stay the same until the end of the period).
- Operational components details, with current status duration, # of impacted users, # of status changes.
- This component is down since 5 minutes, impacting 42 users, making the whole service down.
This component is globally up since more than one week. The exclamation point
indicates that one of the underlaying probes is down, but due to redundancy the whole component is considered to be up. - Details on operational components can be displayed by clicking on the information icon.
- Current service status. Two colors
The automatic calculation of the information is executed every 5 minutes and refreshed in the widget with the same frequency. A manual calculation and visualization is optional.
1.9. Understanding Why a Service Is Down
In the Service Monitoring for 1st Level Support,
icons provide access to a window showing the status of the Operational components. The list of probes (Nagios service in our case) and their status is displayed.
Operational component status, showing the individual states of the checks
2. Technical Annex
2.1. Installation
Please see these pages (log-in or contact your account manager to access them) :
Portal Plugin downloads and installation steps: SLA Installation
NDO Utils installation steps: nexthink/SlaWidget/ndo2db installation for Nagios
2.2. Calculations
2.2.1. Uptime and Outage Definition
- Example where the sub-service is the AND combination of two Nagios factors:
: _____________________________________________ : Coverage :_/ \_______________: : d : :____ ______________________ ____________________: Nagios probe P1 : \____/ \_________/ : :_____________ ___________________ _________: Nagios probe P2 : \__/ \________________/ : : e : : : : : Component = P1 and P2 :____ ___ _______________ _______________: :XX \____/ \__/ \______________/XXXXXXXXXXXXXXX: : a b c d : : : : : Component uptime : ___ ____ ________________ : Component outage : _____ ___ _______________ : : : - a: down because P1 is down
- b: down because P2 is down
- c: going down because P1 goes down
- d: end of coverage
- e: disregarded (since d) because it is outside the coverage
2.2.2. Period's Percentage
Percentage:
Tup / Tcov
Tup:
uptime duration, within coverage, during the chosen period
Tcov:
coverage duration, during the chosen period
2.2.3. Outage Duration
Outage duration:
Tdown
Tdown:
downtime duration, within coverage, during the chosen period
2.2.4. Main Trend
Value for a given time t within the period:
start of t end of period | period :________________________________: : :Consequently, the trend line:Trend value for t:
( Tcov - Tdown(t) ) / Tcov
Tcov:
coverage duration, during the chosen period
Tdown(t):
downtime duration, within coverage, from start of period, to time t
- starts at 100%
- ends at the period's percentage
- can only sink
2.2.5. Sub-services Trend
- Daily percentage during the period.
2.2.6. User Impact
- The various wanted failures are counted within the outage times. See § "Uptime and outage definition" above.
2.3. Debug / demo modes
It is possible to configure SLA widgets so that they do not require a Nagios / NDOUtils installation, or not even an Engine, in order to perform their computations. This is mostly used for debug or demo purposes.
This setting is not available using the Wizard, but only by manually editing a YAML file, for example:
- widget:
ngetID: com_nexthink_SlaTimeSeriesNGet
(...)
config:
(...)
fake: 2 # 0, 1 or 2
(...)The option is named fake, and can have the following values:
fake value
mode
source for SLA information
source for User Impact information
0 (equivalent to no option)
normal
NDOUtils
NEXThink Engine(s)
1
debug
CSV file ${NEXTHINK_PORTAL_HOME}/ngets/com_nexthink_SlaTimeSeriesNGet/bin/test_data.csv
Random
2
demo
Random
Random
2.4. Limitations
- Widget: date range drop-down menu: no i18n of period choice
- Widget: date range: date is not synchronized with the other widgets
- Front-end: allow to choose dates outside Engine range for computation
- MS Internet Explorer: issue when changing of tab too quickly (it is necessary to refresh the page)
CategoryV3 CategoryPortalPlugin CategoryCustomers CategoryNagios
