Fault & Performance Management (FMS & PM)

Beginner Friendly 22 min read Real Telecom Examples NOC Focused

Overview Event vs Fault Fault Management Alarm Limiting RCA Automation Performance Mgt PM Automation Correlation Active/Passive Together Questions

🎯 Learning Objective: Understand Fault Management (FMS) and Performance Management (PM) - two core OSS functions from the FCAPS framework. Learn how they work together to keep the network operational and deliver on SLAs.

Event Management vs Fault Management

Not every event is a fault. Large telecom OSS platforms may receive millions of operational events daily (state changes, threshold crossings, user actions), but only a subset become actionable alarms requiring operator intervention.

📋

Events

State changes, threshold crossings, user actions, informational logs - millions per day

⚠️

Alarms / Faults

Actionable subset requiring operator intervention - correlation, ticket creation, dispatch

🎯

Goal

Filter noise, identify real problems, enable rapid resolution

📊 REAL EXAMPLE

Event: "Port 3 state changed from UP to DOWN" (device-level notification)
Fault: After correlation with 50 other ports, identified as "Fibre cut affecting 50 devices" (actionable alarm)

Fault Management (FMS) - Detecting and Fixing Failures

Fault Management deals with network failures - from detection to resolution. It answers: "What broke, when did it break, and how do we fix it?"

Key Functions

Alarm monitoring - SNMP traps, syslog messages, streaming telemetry via gNMI, and vendor notifications
Alarm correlation - Reducing thousands of alarms to root cause
Suppression & deduplication - Filtering duplicate/non-actionable alarms
Ticket creation - Generating trouble tickets in NOC systems
Escalation - Routing based on severity/SLA
Root cause analysis (RCA) - Identifying underlying failure
Clear & close - Verifying resolution and closing alarms

Key Metrics

MTTR (Mean Time To Repair/Restore)
T2R (Trouble to Resolve)
Alarm volume - Raw vs correlated
False positive rate
First-time fix rate

📡 REAL-WORLD FAULT EXAMPLE

Router interface down → Alarm "LINK-3-UPDOWN" generated
↓
NMS correlates with 50 other "interface down" alarms
↓
Root cause identified - Fibre cut
↓
Trouble ticket created → Field engineer dispatched to fibre location
Without correlation, NOC would see 50 router alarms, not the fibre cut root cause.

How Duplicate Alarms Are Limited (Alarm Storm Control)

Without controls, a single failure can generate thousands of alarms (called an "alarm storm"), overwhelming NOC engineers. OSS platforms use five key techniques to reduce noise from 10,000+ raw events to just a handful of actionable alarms.

1. Deduplication

What it does: Same alarm from same device within time window → one active alarm with counter.

🔧 OSS Configuration (IBM Netcool/OMNIbus SQL-like rule)

                        begin atomic

                          -- Try to find an identical alarm from the last 60 seconds

                          update alerts.status 

                          set tally = tally + 1, LastOccurrence = getdate() 

                          where Summary = 'LINK-DOWN: Gig0/1' 

                            and Node = 'RTR-MAA-01' 

                            and getdate() - LastOccurrence < 60;

                          -- If no row was updated, insert a new one

                          if (sqlrowcount = 0) then

                            insert into alerts.status (Summary, Node, Severity, FirstOccurrence, LastOccurrence, tally)

                            values ('LINK-DOWN: Gig0/1', 'RTR-MAA-01', 3, getdate(), getdate(), 1);

                          end if;

                        end

💻 Console Message (NOC Engineer View)

                        [10:32:15] ALARM: RTR-MAA-01 | Interface Gig0/1 DOWN | Severity: Major

                        [10:32:17] ALARM: RTR-MAA-01 | Interface Gig0/1 DOWN | Severity: Major (DEDUP: occurrence #2)

                        [10:32:19] ALARM: RTR-MAA-01 | Interface Gig0/1 DOWN | Severity: Major (DEDUP: occurrence #3)

                        ...

                        [10:33:00] ACTIVE ALARMS: 

                          > RTR-MAA-01 | Gig0/1 DOWN | Occurrences: 47 | First seen: 10:32:15 | Last seen: 10:33:00

Result: 47 events → 1 alarm line with counter. Engineer sees one issue, not a flood.

2. Suppression (Root Cause Based)

What it does: If root cause identified, downstream (child) alarms are automatically hidden.

🔧 OSS Configuration (SMARTS topology-based suppression)

                        - name: "FIBER-CUT-SUPPRESSES-INTERFACE-DOWN"

                          condition: 

                            root_cause_type: "FiberCut"

                            child_alarm_type: "InterfaceDown"

                            topology_relation: "is_connected_to"

                          action:

                            suppress_child: true

                            set_parent_ticket: "Reference ticket #TKT-12345"

                            log_message: "Child alarm suppressed. Refer to root cause alarm."

💻 Console Message (NOC Engineer View)

                        [10:35:00] ROOT CAUSE ALARM: FIBER CUT | Location: Mumbai-Delhi Link ID: 8172 | Severity: CRITICAL

                        [10:35:01] (Suppressed) RTR-MAA-01 | Gig0/1 DOWN - Suppressed (Root: Fiber Cut)

                        [10:35:01] (Suppressed) RTR-DEL-02 | Gig0/4 DOWN - Suppressed (Root: Fiber Cut)

                        ... (47 more suppressed alarms)

                        [10:35:05] ACTIVE ALARMS: [1 Critical, 48 Suppressed]

                          > ** FIBER CUT (Root Cause) ** | Impact: 48 services, 3 sites

Result: 49 alarms → 1 root cause alarm. NOC dispatches to fibre location, not 48 routers.

3. Throttling / Rate Limiting

What it does: Limit the maximum number of alarms from a single device per minute.

🔧 Device-Level Configuration (Cisco IOS - SNMP Trap Throttling)

                        ! Configure SNMP trap throttling

                        snmp-server trap throttling 10 60000

                        ! 10 traps max per 60,000 ms (60 seconds) window

                        ! Excess traps are dropped, not sent to OSS

                        snmp-server trap-source Loopback0

                        snmp-server host 10.1.100.50 version 2c public throttling

💻 OSS Gateway Log Message

                        [10:40:00] INFO: RTR-HYD-01 | 2000 events/sec received from device.

                        [10:40:01] INFO: RTR-HYD-01 | THROTTLING ACTIVE: Allowed rate = 10/sec.

                        [10:40:01] WARN: RTR-HYD-01 | 1990 events dropped this second (burst suppressed).

                        [10:40:02] INFO: RTR-HYD-01 | 8 events processed this second (within limit).

Result: Router may flap 1000 times, but OSS only processes 10 per minute. NOC not flooded.

4. Correlation (Root Cause Analysis)

What it does: 100 alarms from fibre cut → 1 root cause alarm.

🔧 Correlation Rule (EPL - Esper/Apache Flink style)

                        SELECT 

                          'FIBER-CUT' as RootCause,

                          count(*) as ChildAlarmCount,

                          max(Severity) as MaxSeverity

                        FROM AlarmsStream

                        MATCH_RECOGNIZE (

                          PARTITION BY LocationID

                          ORDER BY timestamp

                          PATTERN (A B+)

                          DEFINE 

                            A AS type = 'OpticalPowerLoss' AND value < -30,

                            B AS type = 'InterfaceDown' 

                              AND abs(timestamp - A.timestamp) < 5000

                        )

                        GROUP BY LocationID;

💻 Correlation Engine Log

                        [10:45:00] CORRELATOR: Processing 212 raw events...

                        [10:45:01] CORRELATOR: Identified root cause - "OpticalPowerLoss" at AGG-BLR-01

                        [10:45:01] CORRELATOR: Matched 198 "InterfaceDown" child alarms

                        [10:45:02] CORRELATOR: Raising 1 alarm: "FIBER-CUT (Correlated)"

                        [10:45:02] CORRELATOR: Suppressing 212 child alarms.

                        [10:45:03] SUCCESS: Alarm correlation ratio 212:1

Result: 212 raw events → 1 correlated root cause alarm.

5. Flapping Detection

What it does: Interface up/down every 5 seconds → treat as single problem, not 1000 events.

🔧 Device-Level Configuration (Cisco IOS - Interface Dampening)

                        interface GigabitEthernet0/2

                         description Link to Customer A

                         dampening 30 1000 3000 180

                         ! Parameters: 

                         !   half-life = 30 sec (how fast penalty decays)

                         !   re-use = 1000 (penalty to stop suppressing)

                         !   suppress = 3000 (penalty to start suppressing)

                         !   max-suppress = 180 sec (max time suppressed)

💻 Console Message (NOC Engineer View)

                        [10:50:00] ALARM: RTR-CHN-01 | Interface Gi0/2 | State: UP

                        [10:50:05] ALARM: RTR-CHN-01 | Interface Gi0/2 | State: DOWN

                        [10:50:10] ALARM: RTR-CHN-01 | Interface Gi0/2 | State: UP

                        [10:50:15] ALARM: RTR-CHN-01 | Interface Gi0/2 | State: DOWN

                        [10:50:20] WARN: RTR-CHN-01 | Interface Gi0/2 | FLAPPING DETECTED (6 events in 30 sec)

                        [10:50:21] INFO: RTR-CHN-01 | Interface Gi0/2 | STATE SUPPRESSED for 120 seconds

                        [10:50:22] INFO: Single ticket created - "Interface Gi0/2 flapping, investigating"

                        [10:50:23 to 10:52:20] ... (No further alarms from this interface) ...

Result: 1000 state changes → 1 "flapping detected" alarm + quiet period.

Technique	NOC Benefit	Configuration Owner	Example Trigger
Deduplication	Stops 500 identical alarms	OSS Admin (platform settings)	"Port down" repeating every 2 seconds
Suppression	Hides 500 child alarms when root cause found	OSS Engineer (correlation rules)	Fiber cut causing 50 routers down
Throttling	Prevents 5000 alarms/sec from crashing OSS	Network Engineer (on the router)	Router CPU spike causing duplicate traps
Correlation	Converts 500 child alarms into 1 actionable alarm	OSS Engineer (correlation engine)	BGP flaps + optical loss = Fiber cut
Flapping Detection	Stops 1000 interface state changes from paging NOC	Network Engineer (on the device)	Faulty SFP causing link oscillation

The Result (What the NOC actually sees):

                    ============================================================

                     NOC ACTIVE ALARMS (Last 5 minutes)

                    ============================================================

                     Raw Events Received: 10,847

                     After Dedup: 3,221

                     After Throttling: 892

                     After Suppression/Correlation: 4

                    ============================================================

                    [CRITICAL] 09:45:12 | Fiber Cut Detected (Bangalore - Chennai) 

                               | Root Cause: Optical Power Loss @ AGG-BLR-01 (-31.2dB)

                               | Impact: 48 Services, 3 Sites, 212 Interfaces (Suppressed)

                    [MAJOR]   09:46:00 | RTR-HYD-01 | CPU = 98% for 5 min (Throttled to 1 alert/min)

                    [MINOR]   09:47:30 | CRS-DEL-02 | Fan Tray 2 degraded

                    [WARNING] 09:48:15 | SW-BLR-03 | High packet drop on VLAN 100 (Correlated)

                    ============================================================

                     10,847 raw events -> 4 actionable alarms -> 2 engineers paged

                    ============================================================

10,000 raw events → 50 actual alarms → 2-3 engineers engaged

Rules Configuration - Technical Level

Pattern matching for alarm correlation
Topology-based suppression rules
Root cause trees
Time windows for deduplication
Rate limits and burst tolerances
Usually configured by OSS engineers

Rules Configuration - Business Level

Severity mapping (what is Critical vs Major)
Who gets notified and how (SMS, email, page)
Maintenance windows (planned work suppression)
VIP customer tagging and SLA rules
Escalation policies and schedules
Usually configured by NOC managers

Root Cause Analysis (RCA) - Automation Levels

RCA identifies the underlying fault that caused alarms. Automation levels vary based on complexity.

Scenario Type	Automation Level	Example
Simple, common faults	Fully automated	"Single access interface down; alarm identified automatically and routed by standard rules"
Correlation-based	Fully automated	100 BGP flaps + optical loss on the same path may be correlated to a likely fibre cut root cause, based on topology and correlation rules.
Medium complexity	Semi-automated (system suggests, human confirms)	Multi-card failure in one node with complex dependencies
Complex, cross-domain	Human-led with AI assistance	"Slow VPN across 3 countries" - multiple domains involved
Novel faults	Manual	First occurrence of new error code with no historical data

Industry Trend:

AIOps (Artificial Intelligence for IT Operations) is making RCA more automated over time. ML models trained on historical incidents can predict root cause with increasing accuracy.

Performance Management (PM) - Measuring Network Quality

Performance Management collects and analyzes network metrics over time. It answers: "How fast, reliable, and efficient is the network?"

Key KPIs

Throughput - Bits per second (bps)
Latency - Packet travel time (ms)
Packet loss - % of packets lost
Jitter - Variation in packet delay
CPU/Memory utilisation - Device resources
PRB utilisation - 5G radio capacity
Availability - uptime percentage, often tracked against targets such as 99.9%, 99.99%, or 99.999% depending on service criticality.

Data Granularities

Near real-time (1-5 min): Live dashboards, threshold alerts
Hourly (15-60 min): Trend analysis, daily reporting
Daily: Capacity planning, long-term analytics
Retention: Raw data (days-weeks), aggregated (months-years)

📡 REAL-WORLD PM EXAMPLE

gNB reports: PRB utilisation = 92% (exceeds 85% threshold)
↓
Threshold Crossing Alert (TCA) generated
↓
NMS analyses trend - Utilisation increasing 5% per week
↓
Capacity planning triggered → Additional spectrum allocated
Performance degradation detected before congestion impacted customers.

Is Performance Management End-to-End Automated?

Mostly yes on the "plumbing" (data movement), but humans are still needed for interpretation and decision-making.

Step	Automation Level	Notes
Data collection (PM counters, polling, streaming)	✅ Fully automated	Devices send or OSS polls on schedule
Ingestion into DB/Data Lake	✅ Fully automated	Scheduled via Airflow or similar
KPI calculation and aggregation	✅ Fully automated	Once formulas are defined in the system
Threshold Crossing Alerts (TCAs)	✅ Fully automated	Triggers when KPI exceeds threshold
Standard dashboards and reports	✅ Automated	Generated on schedule once configured
Choosing what thresholds to set	❌ Human decision	Requires domain knowledge and business context
Investigating anomalies	⚠️ Semi-automated	System flags outliers; humans investigate cause
Root cause of performance issue	⚠️ Partially automated	Correlation helps, but often needs human expertise

Bottom Line:

"Plumbing is automated; interpretation and design are still human."

Critical Distinction: Fault vs Performance

Fault = Binary

Working or broken

Router interface down
Power failure
Hardware malfunction
Fibre cut

Performance = Degradation

Without complete failure

Congestion increases latency
Packet loss rises to 2%
CPU hits 100%
Network still "works" but poorly

Key Insight: A network can degrade in performance even when no hard failure is detected; in some systems that degradation may later raise a QoS or threshold alarm. This is why PM is separate from FMS. Performance degradation often precedes faults - leading to predictive maintenance.

Alarm Category vs Alarm Severity

Category (Type)

Describes the nature of the issue

Communications Alarm - Loss of communication
Quality of Service Alarm - Performance degradation
Processing Error Alarm - Software/hardware error
Environmental Alarm - Temperature, power, door
Equipment Alarm - Fan, line card failure

Severity

Describes the operational impact

Critical - Service-affecting, immediate action
Major - Service degradation, urgent action
Minor - Non-service affecting, routine action
Warning - Informational, monitor
Indeterminate - Severity unknown
Cleared - Condition resolved

⚠️ EXAMPLE: SAME CATEGORY, DIFFERENT SEVERITY

Communications Alarm:
• Critical: Complete link failure affecting 1000 customers
• Minor: Intermittent BGP flapping with no customer impact

Types of Alarms in Telecom Networks

Alarm Type (Category)	Description	Example
Communications Alarm	Loss of communication between NEs	LINK-DOWN, BGP session down
Quality of Service Alarm	Performance degradation below threshold	Latency > 50ms, packet loss > 1%
Processing Error Alarm	Software error, hardware malfunction	CPU overload, memory corruption
Environmental Alarm	Physical conditions	High temperature, power failure, door open
Equipment Alarm	Hardware-specific failure	Fan failure, line card error

Real-World Example: Alarm Correlation

Scenario: A fibre cut in Mumbai affects 50 routers and 3 gNBs

📡

Without Correlation

500+ alarms flood the NOC → operators overwhelmed → root cause unclear → delayed response

✅

After Correlation

NMS identifies "fibre cut" as root cause → suppresses 500 downstream alarms → displays single alarm with impacted service count

📊 BUSINESS IMPACT

Result: NOC dispatches field team to fibre location, not 50 individual routers.
Reduction: Alarm correlation can significantly reduce alarm storms in large telecom networks.

Active vs Passive Performance Monitoring

Passive Monitoring

Collects metrics from real traffic
Low overhead
Reflects actual user experience
Example: Netflow, sFlow, gNMI telemetry

Active Monitoring

Generates synthetic probes/traffic
Additional network overhead
Useful for baseline and SLA validation
Example: TWAMP, ICMP ping, HTTP probes

Modern Best Practice: Modern operators use both approaches for complete operational visibility - passive for scale, active for specific SLAs and baseline measurements.

How FMS and PM Work Together

A complete operational view requires both FMS and PM working in concert.

⚠️

FMS Tells You

"Router interface is down"
An alarm occurred at 10:32:15

📊

PM Tells You

"Before failure, packet loss increased to 5% and latency doubled over 10 minutes"

🔮 PREDICTIVE MAINTENANCE

Combined view enables:
• Predict failures before they happen (correlating PM degradation with future faults)
• Understand customer impact (which services were affected)
• Identify root cause faster (PM data shows what changed before the fault)
• Enable automated or semi-automated remediation workflows

Modern OSS Direction: Closed-Loop Assurance

Modern OSS platforms increasingly automate remediation workflows based on FMS and PM data.

📊

PM Detects

High utilisation / SLA degradation

⚙️

Policy & Orchestration

Policy-driven auto-scaling or traffic rerouting

✅

FMS Verifies

Correction confirmed, no new alarms

AIOps Integration: AI/ML platforms analyze historical FMS and PM data to predict failures before they occur, correlate seemingly unrelated events, and recommend remediation actions automatically.

Connection to BSS

SLA Credits

Performance degradation detected by PM triggers SLA breach → BSS auto-credits affected customers

Customer Notifications

Fault correlation identifies impacted customers → BSS sends proactive outage alerts via CRM

Reporting

PM and assurance data feed customer-facing SLA dashboards and reporting platforms via northbound APIs

Revenue Assurance

FMS tracks service downtime to verify billing accuracy and prevent revenue leakage

Key Terms You Must Know

Fault Management (FMS)
Detects and resolves network failures

Performance Management (PM)
Measures network quality over time

Alarm Correlation
Reducing thousands of alarms to root cause

Alarm Deduplication
Collapsing identical alarms into one with a counter

Rate Limiting / Storm Control
Capping alarms per second to avoid flooding

Root Cause Analysis (RCA)
Identifying underlying failure (partially automated)

Alarm Storm
Hundreds/thousands of alarms from a single root cause

Flapping Detection
Treating rapid interface state changes as a single problem

Threshold Crossing Alert (TCA)
Alert when KPI exceeds defined threshold

PRB Utilisation
5G radio capacity usage (Physical Resource Block)

MTTR
Mean Time To Repair/Restore - FMS metric

T2R
Trouble to Resolve - time from detection to resolution

AIOps
AI/ML for fault prediction and correlation

Active Monitoring
Synthetic probes for baseline measurement

Passive Monitoring
Real traffic metrics collection

Closed-Loop Assurance
Automated or semi-automated remediation workflows

Common Questions

Q1. What is the difference between Fault Management and Performance Management?

Fault Management detects failures (binary: working/broken). Performance Management measures degradation over time (latency, throughput, packet loss). A network can have performance issues without any fault.

Q2. How are duplicate alarms limited so only a few engineers are engaged?

Five techniques: Deduplication (same alarm becomes one with counter), Suppression (root cause hides child alarms), Throttling (cap alarms per second), Correlation (100 alarms → 1 root cause), Flapping detection (state changes as single problem). Result: 10,000 events → 4 alarms → 2-3 engineers.

Q3. What is the difference between deduplication and suppression?

Deduplication collapses identical alarms from the same device into one with a counter. Suppression hides downstream (child) alarms when a root cause alarm is identified. Dedup handles duplicates; suppression handles因果关系.

Q4. What is flapping detection and why is it important?

Flapping detection identifies when an interface or device rapidly alternates between UP and DOWN states. Instead of generating thousands of alarms, it treats the condition as a single problem and suppresses further notifications for a configured period.

Q5. Are alarm rules configurable at business or technical level?

Both. Technical rules (correlation patterns, suppression, time windows) set by OSS engineers. Business rules (severity mapping, notification policies, VIP customer handling) set by NOC managers.

Q6. What is alarm correlation and why is it important?

Alarm correlation reduces thousands of alarms to a single root cause. Without it, NOC operators face alarm storms and cannot identify the actual failure.

Q7. Is Root Cause Analysis (RCA) automated or manual?

Both. Simple faults are fully automated. Medium complexity is semi-automated (system suggests, human confirms). Complex, cross-domain issues are human-led with AI assistance.

Q8. Is Performance Management end-to-end automated?

Mostly yes on data collection and reporting. Humans are still needed for setting thresholds, interpreting anomalies, and deciding actions. "Plumbing is automated; interpretation is human."

Q9. What is a threshold crossing alert (TCA)?

A TCA triggers alarms or warnings when KPIs exceed configured thresholds. TCAs can be warnings (soft) or actionable alarms (hard) depending on configuration.

Q10. What is the difference between active and passive performance monitoring?

Passive monitoring collects metrics from real traffic. Active monitoring generates synthetic probes to measure baseline performance. Both are used together for complete visibility.

Q11. How do FMS and PM enable closed-loop assurance?

PM detects degradation → triggers policy and orchestration for auto-scaling or rerouting → FMS verifies correction and confirms no new alarms. This enables automated or semi-automated remediation workflows.

📌 Key Takeaways:

Fault Management (FMS) detects failures - alarms, correlation, suppression, RCA, tickets. Goal: minimize MTTR/T2R.
Performance Management (PM) measures quality - throughput, latency, packet loss, utilisation, SLAs.
Five alarm limiting techniques: Deduplication, Suppression, Throttling, Correlation, Flapping detection → 10,000 events → 4 alarms → 2-3 engineers
Deduplication collapses identical alarms with a counter; Suppression hides child alarms when root cause found
Flapping detection prevents interface state change storms by treating oscillation as a single problem
RCA automation: Simple = auto, Medium = semi-auto, Complex = human+AI
PM automation: Data collection, ingestion, KPI calc = 100% automated. Threshold setting = human. Anomaly investigation = semi-automated.
Fault is binary (broken/working). Performance is degradation without failure.
Category describes issue nature; Severity describes operational impact.
Active vs passive monitoring - synthetic probes vs real traffic metrics.
Closed-loop assurance enables automated or semi-automated remediation workflows based on FMS/PM data.
FMS + PM together enable predictive maintenance and complete operational visibility.

Previous: Inventory Management Next: Event vs Alarm vs Fault