Fault & Performance Management (FMS & PM)
Learning Objective: Understand Fault Management (FMS) and Performance Management (PM) – two core OSS functions from the FCAPS framework. Learn how they work together to keep the network operational and deliver on SLAs.
Not every event is a fault. OSS platforms receive millions of operational events daily (state changes, threshold crossings, user actions), but only a subset become actionable alarms requiring operator intervention.
Fault Management (FMS) – Detecting and Fixing Failures
Fault Management deals with network failures – from detection to resolution. It answers: "What broke, when did it break, and how do we fix it?"
Key Fault Management Functions
- Alarm monitoring – Receiving SNMP traps, syslog events, telemetry alerts, and vendor fault notifications
- Alarm correlation – Reducing thousands of alarms to root cause
- Suppression & deduplication – Filtering duplicate and non-actionable alarms
- Ticket creation – Generating trouble tickets in NOC systems
- Escalation – Routing to appropriate teams based on severity/SLA
- Root cause analysis (RCA) – Identifying underlying failure
- Clear & close – Verifying resolution and closing alarms
Real-World Fault Example
Router interface down ↓ Alarm "LINK-3-UPDOWN" ↓ NMS correlates with fibre cut alarm ↓ Root cause identified ↓ Ticket created → field engineer dispatched
Without correlation, the interface down alarm alone would not identify the fibre cut root cause.
Performance Management (PM) – Measuring How Well the Network Performs
Performance Management collects and analyzes network metrics over time. It answers: "How fast, reliable, and efficient is the network?"
Key Performance KPIs
- Throughput – Bits per second (bps) transmitted
- Latency – Time for packet to travel (milliseconds)
- Packet loss – Percentage of packets lost
- Jitter – Variation in packet delay
- CPU/Memory utilisation – Device resource usage
- PRB utilisation – 5G radio capacity usage
- Availability – Service uptime percentage (e.g., 99.999% = ~5 minutes downtime/year)
Real-World PM Example
gNB reports: PRB utilisation = 92% ↓ Threshold crossing alert (TCA) ↓ NMS analyses trend ↓ Capacity planning triggered ↓ Additional spectrum allocated
Performance degradation can occur without any fault – PM detects these "soft failures".
Active vs Passive Performance Monitoring
- Passive monitoring: Collecting metrics generated naturally by network traffic and devices. Low overhead, reflects real user experience.
- Active monitoring: Generating synthetic traffic or probes to measure latency, packet loss, and service quality. Useful for baseline measurements and SLA validation.
- Modern operators use both approaches for complete operational visibility.
Critical Distinction: Fault vs Performance
- Fault = binary (working or broken). Example: Router interface down, power failure, hardware malfunction.
- Performance = degradation without failure. Example: Congestion increases latency, packet loss rises, CPU hits 100% – network still "works" but poorly.
- A network can have performance issues with zero faults. This is why PM is separate from FMS.
Alarm type/category describes the nature of the issue (communications, environmental, equipment, processing error). Severity describes the operational impact (critical, major, minor, warning). A communications alarm can be critical (complete outage) or minor (intermittent errors).
Types of Alarms in Telecom Networks
| Alarm Type (Category) | Description | Example |
|---|---|---|
| Communications Alarm | Loss of communication between network elements | LINK-DOWN, BGP session down |
| Quality of Service Alarm | Performance degradation below threshold | Latency > 50ms, packet loss > 1% |
| Processing Error Alarm | Software error, hardware malfunction | CPU overload, memory corruption |
| Environmental Alarm | Physical conditions | High temperature, power failure, door open |
| Equipment Alarm | Hardware-specific failure | Fan failure, line card error |
Real-World Example: Alarm Correlation
A fibre cut in Mumbai affects 50 routers and 3 gNBs:
- Without correlation: 500+ alarms flood the NOC → operators overwhelmed → root cause unclear
- After correlation: NMS identifies "fibre cut" as root cause → suppresses 500 downstream alarms → displays single alarm with impacted service count
- Result: NOC dispatches field team to fibre location, not 50 individual routers
Alarm correlation reduces alarm storms by up to 90-95% in real networks.
Modern OSS platforms increasingly automate remediation workflows. Example: high utilisation detected → orchestration system automatically scales resources or reroutes traffic without manual intervention.
- Near real-time (1-5 min): Live dashboards, threshold alerts, active assurance
- Hourly (15-60 min): Trend analysis, daily reporting
- Daily: Capacity planning, long-term analytics, regulatory reporting
- Data retention: Raw data days to weeks, aggregated data months to years
How FMS and PM Work Together
A complete operational view requires both:
- FMS tells you – "Router interface is down"
- PM tells you – "Before failure, packet loss increased to 5% and latency doubled"
- Combined: You can predict failures before they happen (predictive maintenance) and understand impact
AI/ML platforms analyze historical FMS and PM data to predict failures before they occur, correlate seemingly unrelated events, and recommend remediation actions automatically.
Connection to BSS
- SLA credits: Performance degradation detected by PM triggers SLA breach → BSS auto-credits affected customers
- Customer notifications: Fault correlation identifies impacted customers → BSS sends proactive outage alerts
- Reporting: PM data feeds customer-facing SLA reports and dashboards
- Revenue assurance: FMS tracks service downtime to verify billing accuracy
Common Interview Questions
Q1. What is the difference between Fault Management and Performance Management?
Fault Management detects failures (binary: working/broken). Performance Management measures degradation over time (latency, throughput, packet loss). A network can have performance issues without any fault.
Q2. What is alarm correlation and why is it important?
Alarm correlation reduces thousands of alarms to a single root cause. Without it, NOC operators face alarm storms and cannot identify the actual failure.
Q3. What are common severity levels for alarms?
Critical, Major, Minor, Warning, Indeterminate, Cleared. TMF v4+ uses lowercase (critical, major, minor). Legacy systems often use capitalized (Critical, Major, Minor).
Q4. How does PM support capacity planning?
PM collects utilisation trends over time. Analysis shows when resources will reach capacity, enabling proactive upgrades before congestion occurs.
Q5. What is a threshold crossing alert (TCA)?
A TCA is a common telecom mechanism where alarms or warnings are triggered when KPIs exceed configured thresholds. TCAs can be warnings (soft) or actionable alarms (hard).
Q6. What is the difference between active and passive performance monitoring?
Passive monitoring collects metrics from real traffic. Active monitoring generates synthetic probes to measure baseline performance. Both are used together.
Key Terms
Takeaways for You
- Fault Management (FMS) detects failures – alarms, correlation, suppression, RCA, tickets.
- Performance Management (PM) measures quality – throughput, latency, packet loss, utilisation, SLAs.
- Fault is binary (broken/working). Performance is degradation without failure.
- Alarm correlation transforms alarm storms into actionable root causes (90-95% reduction).
- Threshold crossing alerts (TCAs) warn about performance degradation before failure.
- Active vs passive monitoring – synthetic probes vs real traffic metrics.
- Closed-loop assurance automates remediation based on FMS/PM data.
- FMS + PM together enable predictive maintenance and complete operational visibility.
Recommended Next Learning Path