Skip to content

Instantly share code, notes, and snippets.

@nullren
Created October 1, 2025 18:03
Show Gist options
  • Save nullren/d6b8fdfd73d461617bdf21a5eab46d3a to your computer and use it in GitHub Desktop.
Save nullren/d6b8fdfd73d461617bdf21a5eab46d3a to your computer and use it in GitHub Desktop.
MCP Authentication Investigation: Why prometheus MCP tools aren't working

MCP Authentication Investigation: Why prometheus MCP tools aren't working

Problem Summary

The prometheus MCP server (monitoring-mcp) is failing to connect from the claude-worker, while grafana and alertmanager MCP servers (monitoring-ai-mcp) are working fine. The error seen in logs:

'mcp_servers': [
  {'name': 'alertmanager', 'status': 'connected'},
  {'name': 'grafana', 'status': 'connected'},
  {'name': 'prometheus', 'status': 'failed'}
]

Root Cause

The issue is a JWT authentication mismatch between how the two MCP servers handle authentication:

monitoring-ai-mcp (Working ✅)

Location: monitoring-ai/mcp/unified_mcp_server.py#L76-L94

# Log token details for debugging (no validation)
# todo - pending change to toolbelt
unverified_payload = jwt.decode(token, options={"verify_signature": False})
# ...
# Still allow the request even if decoding fails

Key behavior: Accepts ANY token (even the fake orchestration-agent string) without validation. This is a workaround for the MCP JWT spec issues.

monitoring-mcp (Failing ❌)

Location: monitoring-mcp/src/shared/auth.py#L46-L87

try:
    # Get signing key using PyJWKClient with built-in caching
    signing_key = self.jwks_client.get_signing_key_from_jwt(token)

    # Decode and validate JWT token
    payload = jwt.decode(
        token,
        signing_key.key,
        algorithms=["RS256"],
        issuer=self.expected_issuer,
        options={
            "verify_signature": True,
            "verify_aud": False,  # Disabled per toolbelt reference
            "verify_iss": True,
            "verify_exp": True
        }
    )

Key behavior: Actually validates JWT tokens (signature, issuer, expiration). Rejects the fake orchestration-agent token.

Investigation Details

1. Initial Discovery: Nebula Firewall Issue

First issue was network connectivity - the pod couldn't reach monitoring-mcp due to missing nebula firewall rules.

Pod's nebula groups:

["monitoring-ai-slackbot-temporal-claude-worker", "vault:...", "dev", "us-east-1", "az-b"]

Pod's outbound firewall rule:

- port: "8000"
  proto: tcp
  groups: [dev, monitoring-ai-mcp]  # WRONG - monitoring-mcp service has different groups

monitoring-mcp's actual nebula groups:

["monitoring-mcp", "vault:monitoring-mcp-dev", "dev", "us-east-1"]

Fix Applied: PR #26 - Added outbound rule for monitoring-mcp group

2. After Network Fix: Authentication Failure

Even after network connectivity was established, the MCP server connection still failed. Testing showed:

# Network connectivity works
$ curl http://monitoring-mcp.service.dev-us-east-1.consul:8000
{"service":"Monitoring AI MCP Server", "tools":["hello_world","echo","server_info","prometheus_query","get_rules","list_metrics"]}

# But authentication fails
$ curl -X POST http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp \
  -H "Authorization: Bearer orchestration-agent"
{"error":"invalid_token","error_description":"The access token is malformed or invalid"}

3. Configuration Analysis

claude-worker agent_config.yaml:

mcp_servers:
  grafana:
    type: "http"
    url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=grafana"
    headers:
      Authorization: "Bearer ${GRAFANA_TOKEN:-orchestration-agent}"

  alertmanager:
    type: "http"
    url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=alertmanager"
    headers:
      Authorization: "Bearer ${ALERTMANAGER_TOKEN:-orchestration-agent}"

  prometheus:
    type: "http"
    url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp"
    headers:
      Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}"

All three use the same fake token orchestration-agent, but only prometheus fails.

4. The Toolbelt TODO Connection

Found related TODO in toolbelt/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63:

token, err := jwt.ParseWithClaims(
    bearer,
    &toolbeltjwt.AccessTokenClaims{},
    k.Keyfunc,
    jwt.WithIssuer(self.AuthorizationServers[0]),
    // TODO: I forgot that mcp-go doesn't yet properly implement
    // RFC8707 so the correct audience is not yet being sent and not
    // being incorporated into the token, causing the integration
    // test suite to fail.
    // jwt.WithAudience(selfURL),
    jwt.WithValidMethods([]string{"RS256"}),
)

This explains the issue:

  • The MCP Go client doesn't properly implement RFC8707 (OAuth 2.0 Resource Indicators)
  • The correct audience isn't being sent in JWT tokens
  • As a workaround, audience validation is disabled in toolbelt

Both MCP servers have disabled audience validation to work around this:

  • monitoring-ai-mcp: Completely skips JWT validation
  • monitoring-mcp: Validates JWT but disables audience check (auth.py#L57)

5. Why monitoring-ai-mcp Works

The monitoring-ai-mcp server has JWT validation but completely disabled as a workaround:

Config: monitoring-ai/mcp/gondola.yaml#L82

MCP_JWT_AUDIENCE: "https://monitoring-ai-mcp-dev.tinyspeck.com/monitoring/v1/mcp"

Implementation: Despite having MCP_JWT_AUDIENCE set, the server accepts any token without validation - it's a temporary workaround documented in the code comments.

6. Why monitoring-mcp Doesn't Work

The monitoring-mcp server has proper JWT validation enabled:

Config: monitoring-mcp/gondola.yaml#L63

MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp"

Implementation: Actually validates JWT tokens (signature, issuer, expiration) and rejects invalid tokens.

Development mode option: The server CAN skip validation if MCP_JWT_AUDIENCE is unset (auth.py#L34-L36):

# Skip JWT validation entirely if no audience is configured (development)
if not self.expected_audience:
    logging.info("Skipping JWT validation - development mode (no MCP_JWT_AUDIENCE configured)")
    return await call_next(request)

Proposed Solutions

Option 1: Disable JWT Validation in monitoring-mcp (Recommended for Dev)

Match the monitoring-ai-mcp workaround approach for development environments.

Change: Update monitoring-mcp/gondola.yaml#L60-L63

environment:
  ENVIRONMENT: "development"
  PROMETHEUS_BASE_URL: "http://trickster-dev.internal.ec2.tinyspeck.com:9090"
  # MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp"  # Disabled for dev - no auth required

Pros:

  • Simple one-line change
  • Matches monitoring-ai-mcp behavior
  • Works until proper OAuth is implemented

Cons:

  • No authentication in dev environment
  • Need to remember to re-enable for production

Option 2: Update monitoring-mcp to Skip Validation Like monitoring-ai-mcp

Keep MCP_JWT_AUDIENCE set but skip validation in code.

Change: Update monitoring-mcp/src/shared/auth.py#L46-L87 to match the monitoring-ai-mcp implementation:

try:
    # Log token details for debugging (no validation)
    # TODO: Pending MCP spec fix for RFC8707 audience handling
    # https://slack-github.com/slack/toolbelt/blob/main/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63
    unverified_payload = jwt.decode(token, options={"verify_signature": False})
    logging.info(f"JWT Debug - Allowing request for user: {unverified_payload.get('sub')} (validation skipped)")
    request.state.jwt_payload = unverified_payload
    request.state.user = unverified_payload.get('sub')
except Exception as e:
    logging.warning(f"JWT decode error: {e}")
    # Still allow the request even if decoding fails

Pros:

  • Consistent behavior across both MCP servers
  • Documents the workaround clearly
  • Can still extract user info from token

Cons:

  • Security-wise, accepting any token isn't ideal
  • Need to update code in two places when the MCP spec is fixed

Option 3: Remove Authorization Header from agent_config.yaml

Remove the Authorization header for prometheus and let it fail auth, forcing monitoring-mcp to be updated.

Change: Update monitoring-ai-agent/monitoring-ai-slackbot/temporal/claude-worker/agent_config.yaml#L42-L46:

prometheus:
  type: "http"
  url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp"
  # headers:
  #   Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}"

Pros:

  • Forces the monitoring-mcp team to decide on auth approach
  • Makes it clear auth isn't working
  • Removes fake token that looks real but isn't

Cons:

  • Doesn't solve the problem immediately
  • Still requires monitoring-mcp changes

Recommendation

For immediate fix: Use Option 1 (disable JWT validation in monitoring-mcp dev environment)

For long-term consistency: Use Option 2 (match monitoring-ai-mcp's workaround approach)

For production: Wait for proper MCP OAuth spec implementation and Claude Code SDK support, then re-enable full JWT validation with real tokens in both services.

Related Documentation

Additional Context

The PR that added nebula firewall rules also added documentation to agent_config.yaml about the requirements: monitoring-ai-agent PR #26

# IMPORTANT: When adding new MCP servers, you must also:
# 1. Update settings.json to allow the MCP tools (e.g., mcp__servername__toolname)
# 2. Update gondola.yaml nebula outbound firewall rules to allow connections to the service's nebula groups
#    - Find the service's nebula groups: kubectl get pods -l app=<service> -o yaml | grep "slack.com/nebula.groups"
#    - Add outbound rule with those groups to gondola.yaml under nebula.firewall.outbound
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment