Metrics - Secure MCP Gateway

Overview

The Secure MCP Gateway exports comprehensive metrics via OpenTelemetry to Prometheus. These metrics provide insights into:

Operation Performance: Request rates, latencies, success/failure rates
Cache Efficiency: Hit/miss ratios, cache sizes
Security Events: Guardrail violations, blocked requests, PII redactions
Resource Usage: Active sessions, users, timeout operations
System Health: Authentication success/failure, error rates

Metrics Architecture

┌─────────────────────────────────────┐
│ Secure MCP Gateway                  │
│  ├── Tool Execution Metrics         │
│  ├── Cache Metrics                  │
│  ├── Guardrail Metrics              │
│  ├── Auth Metrics                   │
│  └── Timeout Metrics                │
└─────────────┬───────────────────────┘
              │ OTLP gRPC/HTTP
              ▼
┌─────────────────────────────────────┐
│ OpenTelemetry Collector             │
│  ├── Batch Processor                │
│  └── Prometheus Exporter            │
└─────────────┬───────────────────────┘
              │ Port 8889
              ▼
┌─────────────────────────────────────┐
│ Prometheus                          │
│  ├── Scrape Config                  │
│  ├── TSDB Storage                   │
│  └── PromQL Query Engine            │
└─────────────┬───────────────────────┘
              │ HTTP API
              ▼
┌─────────────────────────────────────┐
│ Grafana Dashboards                  │
│  ├── Gateway Metrics Dashboard      │
│  ├── OpenTelemetry Dashboard        │
│  └── Custom Panels                  │
└─────────────────────────────────────┘

Available Metrics

Operation Metrics

Tool Call Counters

enkrypt_tool_calls_total

Counter

Total number of tool invocations across all servers.Labels: server_name, tool_name, project_idUsage: Track overall tool usage and identify popular tools

enkrypt_tool_call_success_total

Counter

Total number of successful tool calls.Labels: server_name, tool_nameUsage: Calculate success rates, identify reliable tools

enkrypt_tool_call_failure_total

Counter

Total number of failed tool calls (e.g., server errors, timeouts).Labels: server_name, tool_name, error_typeUsage: Monitor error rates, set up alerts for high failure rates

enkrypt_tool_call_error_counter

Counter

Total number of tool call errors (exceptions, crashes).Labels: server_name, error_typeUsage: Track critical errors requiring immediate attention

Tool Call Latency

enkrypt_tool_call_duration_seconds

Histogram

Duration of tool calls in seconds. Includes percentiles (p50, p95, p99).Labels: server_name, tool_nameBuckets: 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, +InfUsage: Identify slow tools, set SLOs, detect performance degradation

Example PromQL Queries:

# Average tool execution time
rate(enkrypt_tool_call_duration_seconds_sum[5m]) 
  / rate(enkrypt_tool_call_duration_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, 
  rate(enkrypt_tool_call_duration_seconds_bucket[5m])
)

# Tool call success rate
rate(enkrypt_tool_call_success_total[5m]) 
  / rate(enkrypt_tool_calls_total[5m])

Server Discovery

enkrypt_list_all_servers_calls

Counter

Number of times enkrypt_list_all_servers was called.Labels: user_id, project_idUsage: Track server listing frequency

enkrypt_servers_discovered

Counter

Total number of servers discovered with tools.Labels: mcp_config_idUsage: Monitor server discovery operations

Cache Metrics

enkrypt_cache_hits_total

Counter

Total number of cache hits (data found in cache).Labels: cache_type (tools, gateway_config, server_config)Usage: Monitor cache effectiveness

enkrypt_cache_misses_total

Counter

Total number of cache misses (data not in cache, fetch required).Labels: cache_typeUsage: Identify cache tuning opportunities

Example PromQL Queries:

# Cache hit ratio
rate(enkrypt_cache_hits_total[5m]) 
  / (rate(enkrypt_cache_hits_total[5m]) + rate(enkrypt_cache_misses_total[5m]))

# Cache miss rate by type
rate(enkrypt_cache_misses_total[5m])

Security Metrics

Guardrail Violations

enkrypt_guardrail_violations_total

Counter

Total number of guardrail violations detected.Labels: violation_type, server_name, detectorUsage: Monitor security posture, detect attack patterns

enkrypt_input_guardrail_violations_total

Counter

Guardrail violations on input (before sending to server).Labels: violation_type, server_nameUsage: Track input validation issues

enkrypt_output_guardrail_violations_total

Counter

Guardrail violations on output (after receiving from server).Labels: violation_type, server_nameUsage: Monitor response quality and safety

enkrypt_relevancy_violations_total

Counter

Relevancy check violations (response not relevant to input).Labels: server_name, tool_name

enkrypt_adherence_violations_total

Counter

Adherence check violations (response doesn’t follow instructions).Labels: server_name, tool_name

enkrypt_hallucination_violations_total

Counter

Hallucination detection violations.Labels: server_name, tool_name

Blocked Requests

enkrypt_tool_call_blocked_total

Counter

Total number of tool calls blocked by guardrails.Labels: server_name, tool_name, reasonUsage: Monitor security enforcement, identify attack attempts

enkrypt_pii_redactions_total

Counter

Total number of PII redaction operations performed.Labels: server_name, pii_typeUsage: Track PII protection, ensure compliance

Guardrail API Performance

enkrypt_api_requests_total

Counter

Total number of API requests to guardrail service.Labels: endpoint, status_codeUsage: Monitor guardrail service usage

enkrypt_api_request_duration_seconds

Histogram

Duration of guardrail API requests in seconds.Labels: endpointBuckets: 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, +InfUsage: Monitor guardrail latency, set alerts for slow checks

Example PromQL Queries:

# Guardrail violation rate
rate(enkrypt_guardrail_violations_total[5m])

# Blocked request rate
rate(enkrypt_tool_call_blocked_total[5m])

# PII redaction frequency
rate(enkrypt_pii_redactions_total[5m])

# Guardrail API latency
histogram_quantile(0.95, 
  rate(enkrypt_api_request_duration_seconds_bucket[5m])
)

Authentication Metrics

enkrypt_auth_success_total

Counter

Total number of successful authentications.Labels: auth_provider, project_idUsage: Monitor authentication activity

enkrypt_auth_failure_total

Counter

Total number of failed authentication attempts.Labels: auth_provider, failure_reasonUsage: Detect unauthorized access attempts, brute force attacks

enkrypt_active_sessions

UpDownCounter

Current number of active sessions.Usage: Monitor concurrent connections

enkrypt_active_users

UpDownCounter

Current number of active users.Usage: Track user concurrency

Example PromQL Queries:

# Authentication failure rate
rate(enkrypt_auth_failure_total[5m])

# Current active sessions
enkrypt_active_sessions

# Auth success rate
rate(enkrypt_auth_success_total[5m]) 
  / (rate(enkrypt_auth_success_total[5m]) + rate(enkrypt_auth_failure_total[5m]))

Timeout Management Metrics

enkrypt_timeout_operations_total

Counter

Total number of timeout operations tracked.Labels: operation_typeUsage: Monitor timeout tracking coverage

enkrypt_timeout_operations_successful

Counter

Number of operations completed successfully before timeout.Labels: operation_type

enkrypt_timeout_operations_timed_out

Counter

Number of operations that exceeded timeout threshold.Labels: operation_typeUsage: Set alerts for high timeout rates

enkrypt_timeout_operations_cancelled

Counter

Number of operations that were cancelled.Labels: operation_type

Timeout Escalations

enkrypt_timeout_escalation_warn

Counter

Number of timeout escalation warnings (>80% of timeout).Labels: operation_typeUsage: Early warning for slow operations

enkrypt_timeout_escalation_timeout

Counter

Number of operations that reached timeout threshold.Labels: operation_type

enkrypt_timeout_escalation_fail

Counter

Number of operations that exceeded timeout and failed.Labels: operation_type

Timeout Performance

enkrypt_timeout_operation_duration_seconds

Histogram

Duration of timeout-managed operations in seconds.Labels: operation_typeBuckets: 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, +Inf

enkrypt_timeout_active_operations

UpDownCounter

Number of currently active timeout operations.Usage: Monitor concurrent operations

Example PromQL Queries:

# Timeout rate by operation type
rate(enkrypt_timeout_operations_timed_out[5m]) 
  / rate(enkrypt_timeout_operations_total[5m])

# Operations approaching timeout
rate(enkrypt_timeout_escalation_warn[5m])

# Average operation duration
rate(enkrypt_timeout_operation_duration_seconds_sum[5m]) 
  / rate(enkrypt_timeout_operation_duration_seconds_count[5m])

Metrics Implementation

Creating Metrics

Metrics are created during OpenTelemetry provider initialization: Location: src/secure_mcp_gateway/plugins/telemetry/opentelemetry_provider.py:373

def _create_metrics(self):
    """Create all metrics."""
    # Counters
    self.tool_call_counter = self._meter.create_counter(
        name="enkrypt_tool_calls_total",
        description="Total number of tool calls",
        unit="1",
    )
    
    # Histograms
    self.tool_call_duration = self._meter.create_histogram(
        name="enkrypt_tool_call_duration_seconds",
        description="Duration of tool calls in seconds",
        unit="s",
    )
    
    # Gauges (UpDownCounter)
    self.active_sessions_gauge = self._meter.create_up_down_counter(
        "enkrypt_active_sessions",
        description="Current active sessions",
        unit="1"
    )

Recording Metrics

Metrics are recorded throughout the gateway:

from secure_mcp_gateway.plugins.telemetry import get_telemetry_config_manager

telemetry_manager = get_telemetry_config_manager()

# Increment counter
telemetry_manager.tool_call_counter.add(
    1,
    attributes={
        "server_name": "github_server",
        "tool_name": "create_issue",
        "project_id": project_id
    }
)

# Record histogram
telemetry_manager.tool_call_duration.record(
    duration_seconds,
    attributes={
        "server_name": server_name,
        "tool_name": tool_name
    }
)

# Update gauge
telemetry_manager.active_sessions_gauge.add(1)  # Increment
telemetry_manager.active_sessions_gauge.add(-1)  # Decrement

Prometheus Configuration

Scrape Configuration

Location: infra/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']

Accessing Prometheus

UI: http://localhost:9090
API: http://localhost:9090/api/v1/query
Targets: http://localhost:9090/targets
Config: http://localhost:9090/config

Example Queries

In Prometheus UI (Expression Browser):

# Total tool calls in last 5 minutes
sum(increase(enkrypt_tool_calls_total[5m]))

# Top 5 most used tools
topk(5, sum by (tool_name) (enkrypt_tool_calls_total))

# Error rate percentage
(rate(enkrypt_tool_call_failure_total[5m]) 
  / rate(enkrypt_tool_calls_total[5m])) * 100

# Average latency by server
avg by (server_name) (
  rate(enkrypt_tool_call_duration_seconds_sum[5m]) 
  / rate(enkrypt_tool_call_duration_seconds_count[5m])
)

Grafana Dashboards

Pre-built Dashboards

The gateway includes two pre-configured Grafana dashboards:

1. Gateway Metrics Dashboard

Location: infra/grafana/provisioning/dashboards/gateway-metrics.json Panels:

Tool Call Rate (graph)
Tool Call Success Rate (gauge)
Tool Call Latency p95 (graph)
Cache Hit Ratio (gauge)
Guardrail Violations (graph)
Active Sessions (gauge)
Error Rate (graph)

2. OpenTelemetry Gateway Metrics Dashboard

Location: infra/grafana/provisioning/dashboards/OpenTelemetry Gateway Metrics.json Panels:

Request Volume
Response Times (percentiles)
Error Rates by Type
Throughput
System Health

Accessing Grafana

Open http://localhost:3000
No login required (anonymous admin mode)
Navigate to Dashboards → Browse
Select “Gateway Metrics” or “OpenTelemetry Gateway Metrics”

Creating Custom Dashboards

Example Panel (Tool Call Rate):

{
  "type": "graph",
  "title": "Tool Call Rate",
  "targets": [
    {
      "expr": "rate(enkrypt_tool_calls_total[5m])",
      "legendFormat": "{{server_name}} - {{tool_name}}"
    }
  ]
}

Example Panel (Cache Hit Ratio):

{
  "type": "gauge",
  "title": "Cache Hit Ratio",
  "targets": [
    {
      "expr": "rate(enkrypt_cache_hits_total[5m]) / (rate(enkrypt_cache_hits_total[5m]) + rate(enkrypt_cache_misses_total[5m]))",
      "legendFormat": "Hit Ratio"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "max": 1,
      "min": 0,
      "unit": "percentunit"
    }
  }
}

Alerting

Prometheus Alerts

Create alert rules in Prometheus: alerting_rules.yml:

groups:
  - name: gateway_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          rate(enkrypt_tool_call_failure_total[5m]) 
          / rate(enkrypt_tool_calls_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"
      
      # Slow tool execution
      - alert: SlowToolExecution
        expr: |
          histogram_quantile(0.95, 
            rate(enkrypt_tool_call_duration_seconds_bucket[5m])
          ) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Slow tool execution (p95 > 5s)"
      
      # Cache miss rate too high
      - alert: HighCacheMissRate
        expr: |
          rate(enkrypt_cache_misses_total[5m]) 
          / (rate(enkrypt_cache_hits_total[5m]) + rate(enkrypt_cache_misses_total[5m])) > 0.5
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "High cache miss rate (>50%)"
      
      # Security: High guardrail violation rate
      - alert: HighGuardrailViolations
        expr: rate(enkrypt_guardrail_violations_total[5m]) > 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High guardrail violation rate"
          description: "{{ $value }} violations per second"
      
      # Authentication failures
      - alert: HighAuthFailureRate
        expr: |
          rate(enkrypt_auth_failure_total[5m]) 
          / (rate(enkrypt_auth_success_total[5m]) + rate(enkrypt_auth_failure_total[5m])) > 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High authentication failure rate (>20%)"

Grafana Alerts

Create alerts in Grafana dashboards:

Edit panel → Alert tab
Create alert rule
Configure notification channels (Slack, PagerDuty, email)

Best Practices

Monitor Key SLIs

Focus on Service Level Indicators:

Availability: Success rate > 99.9%
Latency: p95 < 500ms, p99 < 1s
Error Rate: < 0.1%
Cache Hit Ratio: > 80%

Set Up Alerts

Configure alerts for:

High error rates
Slow operations (p95 > threshold)
Security events (guardrail violations)
Resource exhaustion (high active sessions)

Use Labels Wisely

Avoid high-cardinality labels:

✅ Good: server_name, tool_name, project_id
❌ Bad: user_id, request_id, timestamp

High cardinality increases memory usage and query time.

Analyze Trends

Use histograms for distribution analysis:

# Tool call latency distribution
histogram_quantile(0.5, rate(enkrypt_tool_call_duration_seconds_bucket[5m]))  # p50
histogram_quantile(0.95, rate(enkrypt_tool_call_duration_seconds_bucket[5m])) # p95
histogram_quantile(0.99, rate(enkrypt_tool_call_duration_seconds_bucket[5m])) # p99

Monitor Security Metrics

Track security events:

Guardrail violations by type
Blocked requests over time
PII redaction frequency
Authentication failures

Set up alerts for anomalies.

Troubleshooting

Metrics Not Appearing in Prometheus

Check collector metrics endpoint:
```
curl http://localhost:8889/metrics
```
Verify Prometheus scrape targets: http://localhost:9090/targets

Check collector logs:

docker logs otel-collector | grep prometheus

High Cardinality Issues

Symptom: Prometheus using excessive memory Solution: Reduce label cardinality

# Before (high cardinality)
metric.add(1, attributes={"user_id": user_id})  # ❌

# After (low cardinality)
metric.add(1, attributes={"project_id": project_id})  # ✅

Dashboards Not Loading

Check Grafana logs:
```
docker logs grafana
```
Verify datasource connection: Grafana → Connections → Data sources → Prometheus → Test

Check dashboard JSON:

cat infra/grafana/provisioning/dashboards/gateway-metrics.json | jq

Next Steps

Logging

Configure structured logging and log aggregation

OpenTelemetry Setup

Set up OTLP export and distributed tracing

Overview

Return to observability overview

API Reference

Explore the monitoring API

​Overview

​Metrics Architecture

​Available Metrics

​Operation Metrics

​Tool Call Counters

​Tool Call Latency

​Server Discovery

​Cache Metrics

​Security Metrics

​Guardrail Violations

​Blocked Requests

​Guardrail API Performance

​Authentication Metrics

​Timeout Management Metrics

​Timeout Escalations

​Timeout Performance

​Metrics Implementation

​Creating Metrics

​Recording Metrics

​Prometheus Configuration

​Scrape Configuration

​Accessing Prometheus

​Example Queries

​Grafana Dashboards

​Pre-built Dashboards

​1. Gateway Metrics Dashboard

​2. OpenTelemetry Gateway Metrics Dashboard

​Accessing Grafana

​Creating Custom Dashboards

​Alerting

​Prometheus Alerts

​Grafana Alerts

​Best Practices

​Troubleshooting

​Metrics Not Appearing in Prometheus

​High Cardinality Issues

​Dashboards Not Loading

​Next Steps

Logging

OpenTelemetry Setup

Overview

API Reference

Overview

Metrics Architecture

Available Metrics

Operation Metrics

Tool Call Counters

Tool Call Latency

Server Discovery

Cache Metrics

Security Metrics

Guardrail Violations

Blocked Requests

Guardrail API Performance

Authentication Metrics

Timeout Management Metrics

Timeout Escalations

Timeout Performance

Metrics Implementation

Creating Metrics

Recording Metrics

Prometheus Configuration

Scrape Configuration

Accessing Prometheus

Example Queries

Grafana Dashboards

Pre-built Dashboards

1. Gateway Metrics Dashboard

2. OpenTelemetry Gateway Metrics Dashboard

Accessing Grafana

Creating Custom Dashboards

Alerting

Prometheus Alerts

Grafana Alerts

Best Practices

Troubleshooting

Metrics Not Appearing in Prometheus

High Cardinality Issues

Dashboards Not Loading

Next Steps