Skip to main content

Overview

The Secure MCP Gateway exports comprehensive metrics via OpenTelemetry to Prometheus. These metrics provide insights into:
  • Operation Performance: Request rates, latencies, success/failure rates
  • Cache Efficiency: Hit/miss ratios, cache sizes
  • Security Events: Guardrail violations, blocked requests, PII redactions
  • Resource Usage: Active sessions, users, timeout operations
  • System Health: Authentication success/failure, error rates

Metrics Architecture

┌─────────────────────────────────────┐
│ Secure MCP Gateway                  │
│  ├── Tool Execution Metrics         │
│  ├── Cache Metrics                  │
│  ├── Guardrail Metrics              │
│  ├── Auth Metrics                   │
│  └── Timeout Metrics                │
└─────────────┬───────────────────────┘
              │ OTLP gRPC/HTTP

┌─────────────────────────────────────┐
│ OpenTelemetry Collector             │
│  ├── Batch Processor                │
│  └── Prometheus Exporter            │
└─────────────┬───────────────────────┘
              │ Port 8889

┌─────────────────────────────────────┐
│ Prometheus                          │
│  ├── Scrape Config                  │
│  ├── TSDB Storage                   │
│  └── PromQL Query Engine            │
└─────────────┬───────────────────────┘
              │ HTTP API

┌─────────────────────────────────────┐
│ Grafana Dashboards                  │
│  ├── Gateway Metrics Dashboard      │
│  ├── OpenTelemetry Dashboard        │
│  └── Custom Panels                  │
└─────────────────────────────────────┘

Available Metrics

Operation Metrics

Tool Call Counters

enkrypt_tool_calls_total
Counter
Total number of tool invocations across all servers.Labels: server_name, tool_name, project_idUsage: Track overall tool usage and identify popular tools
enkrypt_tool_call_success_total
Counter
Total number of successful tool calls.Labels: server_name, tool_nameUsage: Calculate success rates, identify reliable tools
enkrypt_tool_call_failure_total
Counter
Total number of failed tool calls (e.g., server errors, timeouts).Labels: server_name, tool_name, error_typeUsage: Monitor error rates, set up alerts for high failure rates
enkrypt_tool_call_error_counter
Counter
Total number of tool call errors (exceptions, crashes).Labels: server_name, error_typeUsage: Track critical errors requiring immediate attention

Tool Call Latency

enkrypt_tool_call_duration_seconds
Histogram
Duration of tool calls in seconds. Includes percentiles (p50, p95, p99).Labels: server_name, tool_nameBuckets: 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, +InfUsage: Identify slow tools, set SLOs, detect performance degradation
Example PromQL Queries:
# Average tool execution time
rate(enkrypt_tool_call_duration_seconds_sum[5m]) 
  / rate(enkrypt_tool_call_duration_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, 
  rate(enkrypt_tool_call_duration_seconds_bucket[5m])
)

# Tool call success rate
rate(enkrypt_tool_call_success_total[5m]) 
  / rate(enkrypt_tool_calls_total[5m])

Server Discovery

enkrypt_list_all_servers_calls
Counter
Number of times enkrypt_list_all_servers was called.Labels: user_id, project_idUsage: Track server listing frequency
enkrypt_servers_discovered
Counter
Total number of servers discovered with tools.Labels: mcp_config_idUsage: Monitor server discovery operations

Cache Metrics

enkrypt_cache_hits_total
Counter
Total number of cache hits (data found in cache).Labels: cache_type (tools, gateway_config, server_config)Usage: Monitor cache effectiveness
enkrypt_cache_misses_total
Counter
Total number of cache misses (data not in cache, fetch required).Labels: cache_typeUsage: Identify cache tuning opportunities
Example PromQL Queries:
# Cache hit ratio
rate(enkrypt_cache_hits_total[5m]) 
  / (rate(enkrypt_cache_hits_total[5m]) + rate(enkrypt_cache_misses_total[5m]))

# Cache miss rate by type
rate(enkrypt_cache_misses_total[5m])

Security Metrics

Guardrail Violations

enkrypt_guardrail_violations_total
Counter
Total number of guardrail violations detected.Labels: violation_type, server_name, detectorUsage: Monitor security posture, detect attack patterns
enkrypt_input_guardrail_violations_total
Counter
Guardrail violations on input (before sending to server).Labels: violation_type, server_nameUsage: Track input validation issues
enkrypt_output_guardrail_violations_total
Counter
Guardrail violations on output (after receiving from server).Labels: violation_type, server_nameUsage: Monitor response quality and safety
enkrypt_relevancy_violations_total
Counter
Relevancy check violations (response not relevant to input).Labels: server_name, tool_name
enkrypt_adherence_violations_total
Counter
Adherence check violations (response doesn’t follow instructions).Labels: server_name, tool_name
enkrypt_hallucination_violations_total
Counter
Hallucination detection violations.Labels: server_name, tool_name

Blocked Requests

enkrypt_tool_call_blocked_total
Counter
Total number of tool calls blocked by guardrails.Labels: server_name, tool_name, reasonUsage: Monitor security enforcement, identify attack attempts
enkrypt_pii_redactions_total
Counter
Total number of PII redaction operations performed.Labels: server_name, pii_typeUsage: Track PII protection, ensure compliance

Guardrail API Performance

enkrypt_api_requests_total
Counter
Total number of API requests to guardrail service.Labels: endpoint, status_codeUsage: Monitor guardrail service usage
enkrypt_api_request_duration_seconds
Histogram
Duration of guardrail API requests in seconds.Labels: endpointBuckets: 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, +InfUsage: Monitor guardrail latency, set alerts for slow checks
Example PromQL Queries:
# Guardrail violation rate
rate(enkrypt_guardrail_violations_total[5m])

# Blocked request rate
rate(enkrypt_tool_call_blocked_total[5m])

# PII redaction frequency
rate(enkrypt_pii_redactions_total[5m])

# Guardrail API latency
histogram_quantile(0.95, 
  rate(enkrypt_api_request_duration_seconds_bucket[5m])
)

Authentication Metrics

enkrypt_auth_success_total
Counter
Total number of successful authentications.Labels: auth_provider, project_idUsage: Monitor authentication activity
enkrypt_auth_failure_total
Counter
Total number of failed authentication attempts.Labels: auth_provider, failure_reasonUsage: Detect unauthorized access attempts, brute force attacks
enkrypt_active_sessions
UpDownCounter
Current number of active sessions.Usage: Monitor concurrent connections
enkrypt_active_users
UpDownCounter
Current number of active users.Usage: Track user concurrency
Example PromQL Queries:
# Authentication failure rate
rate(enkrypt_auth_failure_total[5m])

# Current active sessions
enkrypt_active_sessions

# Auth success rate
rate(enkrypt_auth_success_total[5m]) 
  / (rate(enkrypt_auth_success_total[5m]) + rate(enkrypt_auth_failure_total[5m]))

Timeout Management Metrics

enkrypt_timeout_operations_total
Counter
Total number of timeout operations tracked.Labels: operation_typeUsage: Monitor timeout tracking coverage
enkrypt_timeout_operations_successful
Counter
Number of operations completed successfully before timeout.Labels: operation_type
enkrypt_timeout_operations_timed_out
Counter
Number of operations that exceeded timeout threshold.Labels: operation_typeUsage: Set alerts for high timeout rates
enkrypt_timeout_operations_cancelled
Counter
Number of operations that were cancelled.Labels: operation_type

Timeout Escalations

enkrypt_timeout_escalation_warn
Counter
Number of timeout escalation warnings (>80% of timeout).Labels: operation_typeUsage: Early warning for slow operations
enkrypt_timeout_escalation_timeout
Counter
Number of operations that reached timeout threshold.Labels: operation_type
enkrypt_timeout_escalation_fail
Counter
Number of operations that exceeded timeout and failed.Labels: operation_type

Timeout Performance

enkrypt_timeout_operation_duration_seconds
Histogram
Duration of timeout-managed operations in seconds.Labels: operation_typeBuckets: 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, +Inf
enkrypt_timeout_active_operations
UpDownCounter
Number of currently active timeout operations.Usage: Monitor concurrent operations
Example PromQL Queries:
# Timeout rate by operation type
rate(enkrypt_timeout_operations_timed_out[5m]) 
  / rate(enkrypt_timeout_operations_total[5m])

# Operations approaching timeout
rate(enkrypt_timeout_escalation_warn[5m])

# Average operation duration
rate(enkrypt_timeout_operation_duration_seconds_sum[5m]) 
  / rate(enkrypt_timeout_operation_duration_seconds_count[5m])

Metrics Implementation

Creating Metrics

Metrics are created during OpenTelemetry provider initialization: Location: src/secure_mcp_gateway/plugins/telemetry/opentelemetry_provider.py:373
def _create_metrics(self):
    """Create all metrics."""
    # Counters
    self.tool_call_counter = self._meter.create_counter(
        name="enkrypt_tool_calls_total",
        description="Total number of tool calls",
        unit="1",
    )
    
    # Histograms
    self.tool_call_duration = self._meter.create_histogram(
        name="enkrypt_tool_call_duration_seconds",
        description="Duration of tool calls in seconds",
        unit="s",
    )
    
    # Gauges (UpDownCounter)
    self.active_sessions_gauge = self._meter.create_up_down_counter(
        "enkrypt_active_sessions",
        description="Current active sessions",
        unit="1"
    )

Recording Metrics

Metrics are recorded throughout the gateway:
from secure_mcp_gateway.plugins.telemetry import get_telemetry_config_manager

telemetry_manager = get_telemetry_config_manager()

# Increment counter
telemetry_manager.tool_call_counter.add(
    1,
    attributes={
        "server_name": "github_server",
        "tool_name": "create_issue",
        "project_id": project_id
    }
)

# Record histogram
telemetry_manager.tool_call_duration.record(
    duration_seconds,
    attributes={
        "server_name": server_name,
        "tool_name": tool_name
    }
)

# Update gauge
telemetry_manager.active_sessions_gauge.add(1)  # Increment
telemetry_manager.active_sessions_gauge.add(-1)  # Decrement

Prometheus Configuration

Scrape Configuration

Location: infra/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']

Accessing Prometheus

Example Queries

In Prometheus UI (Expression Browser):
# Total tool calls in last 5 minutes
sum(increase(enkrypt_tool_calls_total[5m]))

# Top 5 most used tools
topk(5, sum by (tool_name) (enkrypt_tool_calls_total))

# Error rate percentage
(rate(enkrypt_tool_call_failure_total[5m]) 
  / rate(enkrypt_tool_calls_total[5m])) * 100

# Average latency by server
avg by (server_name) (
  rate(enkrypt_tool_call_duration_seconds_sum[5m]) 
  / rate(enkrypt_tool_call_duration_seconds_count[5m])
)

Grafana Dashboards

Pre-built Dashboards

The gateway includes two pre-configured Grafana dashboards:

1. Gateway Metrics Dashboard

Location: infra/grafana/provisioning/dashboards/gateway-metrics.json Panels:
  • Tool Call Rate (graph)
  • Tool Call Success Rate (gauge)
  • Tool Call Latency p95 (graph)
  • Cache Hit Ratio (gauge)
  • Guardrail Violations (graph)
  • Active Sessions (gauge)
  • Error Rate (graph)

2. OpenTelemetry Gateway Metrics Dashboard

Location: infra/grafana/provisioning/dashboards/OpenTelemetry Gateway Metrics.json Panels:
  • Request Volume
  • Response Times (percentiles)
  • Error Rates by Type
  • Throughput
  • System Health

Accessing Grafana

  1. Open http://localhost:3000
  2. No login required (anonymous admin mode)
  3. Navigate to Dashboards → Browse
  4. Select “Gateway Metrics” or “OpenTelemetry Gateway Metrics”

Creating Custom Dashboards

Example Panel (Tool Call Rate):
{
  "type": "graph",
  "title": "Tool Call Rate",
  "targets": [
    {
      "expr": "rate(enkrypt_tool_calls_total[5m])",
      "legendFormat": "{{server_name}} - {{tool_name}}"
    }
  ]
}
Example Panel (Cache Hit Ratio):
{
  "type": "gauge",
  "title": "Cache Hit Ratio",
  "targets": [
    {
      "expr": "rate(enkrypt_cache_hits_total[5m]) / (rate(enkrypt_cache_hits_total[5m]) + rate(enkrypt_cache_misses_total[5m]))",
      "legendFormat": "Hit Ratio"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "max": 1,
      "min": 0,
      "unit": "percentunit"
    }
  }
}

Alerting

Prometheus Alerts

Create alert rules in Prometheus: alerting_rules.yml:
groups:
  - name: gateway_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          rate(enkrypt_tool_call_failure_total[5m]) 
          / rate(enkrypt_tool_calls_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"
      
      # Slow tool execution
      - alert: SlowToolExecution
        expr: |
          histogram_quantile(0.95, 
            rate(enkrypt_tool_call_duration_seconds_bucket[5m])
          ) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Slow tool execution (p95 > 5s)"
      
      # Cache miss rate too high
      - alert: HighCacheMissRate
        expr: |
          rate(enkrypt_cache_misses_total[5m]) 
          / (rate(enkrypt_cache_hits_total[5m]) + rate(enkrypt_cache_misses_total[5m])) > 0.5
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "High cache miss rate (>50%)"
      
      # Security: High guardrail violation rate
      - alert: HighGuardrailViolations
        expr: rate(enkrypt_guardrail_violations_total[5m]) > 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High guardrail violation rate"
          description: "{{ $value }} violations per second"
      
      # Authentication failures
      - alert: HighAuthFailureRate
        expr: |
          rate(enkrypt_auth_failure_total[5m]) 
          / (rate(enkrypt_auth_success_total[5m]) + rate(enkrypt_auth_failure_total[5m])) > 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High authentication failure rate (>20%)"

Grafana Alerts

Create alerts in Grafana dashboards:
  1. Edit panel → Alert tab
  2. Create alert rule
  3. Configure notification channels (Slack, PagerDuty, email)

Best Practices

Focus on Service Level Indicators:
  • Availability: Success rate > 99.9%
  • Latency: p95 < 500ms, p99 < 1s
  • Error Rate: < 0.1%
  • Cache Hit Ratio: > 80%
Configure alerts for:
  • High error rates
  • Slow operations (p95 > threshold)
  • Security events (guardrail violations)
  • Resource exhaustion (high active sessions)
Avoid high-cardinality labels:
  • ✅ Good: server_name, tool_name, project_id
  • ❌ Bad: user_id, request_id, timestamp
High cardinality increases memory usage and query time.
Track security events:
  • Guardrail violations by type
  • Blocked requests over time
  • PII redaction frequency
  • Authentication failures
Set up alerts for anomalies.

Troubleshooting

Metrics Not Appearing in Prometheus

  1. Check collector metrics endpoint:
    curl http://localhost:8889/metrics
    
  2. Verify Prometheus scrape targets: http://localhost:9090/targets
  3. Check collector logs:
    docker logs otel-collector | grep prometheus
    

High Cardinality Issues

Symptom: Prometheus using excessive memory Solution: Reduce label cardinality
# Before (high cardinality)
metric.add(1, attributes={"user_id": user_id})  # ❌

# After (low cardinality)
metric.add(1, attributes={"project_id": project_id})  # ✅

Dashboards Not Loading

  1. Check Grafana logs:
    docker logs grafana
    
  2. Verify datasource connection: Grafana → Connections → Data sources → Prometheus → Test
  3. Check dashboard JSON:
    cat infra/grafana/provisioning/dashboards/gateway-metrics.json | jq
    

Next Steps

Logging

Configure structured logging and log aggregation

OpenTelemetry Setup

Set up OTLP export and distributed tracing

Overview

Return to observability overview

API Reference

Explore the monitoring API