# MCP Authentication Investigation: Why prometheus MCP tools aren't working

## Problem Summary

The prometheus MCP server (`monitoring-mcp`) is failing to connect from the claude-worker, while grafana and alertmanager MCP servers (`monitoring-ai-mcp`) are working fine. The error seen in logs:

```
'mcp_servers': [
  {'name': 'alertmanager', 'status': 'connected'},
  {'name': 'grafana', 'status': 'connected'},
  {'name': 'prometheus', 'status': 'failed'}
]
```

## Root Cause

The issue is a **JWT authentication mismatch** between how the two MCP servers handle authentication:

### monitoring-ai-mcp (Working ✅)
**Location:** [monitoring-ai/mcp/unified_mcp_server.py#L76-L94](https://slack-github.com/slack/monitoring-ai/blob/main/mcp/unified_mcp_server.py#L76-L94)

```python
# Log token details for debugging (no validation)
# todo - pending change to toolbelt
unverified_payload = jwt.decode(token, options={"verify_signature": False})
# ...
# Still allow the request even if decoding fails
```

**Key behavior:** Accepts ANY token (even the fake `orchestration-agent` string) without validation. This is a workaround for the MCP JWT spec issues.

### monitoring-mcp (Failing ❌)
**Location:** [monitoring-mcp/src/shared/auth.py#L46-L87](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L46-L87)

```python
try:
    # Get signing key using PyJWKClient with built-in caching
    signing_key = self.jwks_client.get_signing_key_from_jwt(token)

    # Decode and validate JWT token
    payload = jwt.decode(
        token,
        signing_key.key,
        algorithms=["RS256"],
        issuer=self.expected_issuer,
        options={
            "verify_signature": True,
            "verify_aud": False,  # Disabled per toolbelt reference
            "verify_iss": True,
            "verify_exp": True
        }
    )
```

**Key behavior:** Actually validates JWT tokens (signature, issuer, expiration). Rejects the fake `orchestration-agent` token.

## Investigation Details

### 1. Initial Discovery: Nebula Firewall Issue

First issue was network connectivity - the pod couldn't reach monitoring-mcp due to missing nebula firewall rules.

**Pod's nebula groups:**
```json
["monitoring-ai-slackbot-temporal-claude-worker", "vault:...", "dev", "us-east-1", "az-b"]
```

**Pod's outbound firewall rule:**
```yaml
- port: "8000"
  proto: tcp
  groups: [dev, monitoring-ai-mcp]  # WRONG - monitoring-mcp service has different groups
```

**monitoring-mcp's actual nebula groups:**
```json
["monitoring-mcp", "vault:monitoring-mcp-dev", "dev", "us-east-1"]
```

**Fix Applied:** [PR #26](https://slack-github.com/slack/monitoring-ai-agent/pull/26) - Added outbound rule for `monitoring-mcp` group
- [gondola.yaml changes](https://slack-github.com/slack/monitoring-ai-agent/pull/26/files#diff-monitoring-ai-slackbot/gondola.yaml)

### 2. After Network Fix: Authentication Failure

Even after network connectivity was established, the MCP server connection still failed. Testing showed:

```bash
# Network connectivity works
$ curl http://monitoring-mcp.service.dev-us-east-1.consul:8000
{"service":"Monitoring AI MCP Server", "tools":["hello_world","echo","server_info","prometheus_query","get_rules","list_metrics"]}

# But authentication fails
$ curl -X POST http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp \
  -H "Authorization: Bearer orchestration-agent"
{"error":"invalid_token","error_description":"The access token is malformed or invalid"}
```

### 3. Configuration Analysis

**claude-worker agent_config.yaml:**
```yaml
mcp_servers:
  grafana:
    type: "http"
    url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=grafana"
    headers:
      Authorization: "Bearer ${GRAFANA_TOKEN:-orchestration-agent}"

  alertmanager:
    type: "http"
    url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=alertmanager"
    headers:
      Authorization: "Bearer ${ALERTMANAGER_TOKEN:-orchestration-agent}"

  prometheus:
    type: "http"
    url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp"
    headers:
      Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}"
```

All three use the same fake token `orchestration-agent`, but only prometheus fails.

### 4. The Toolbelt TODO Connection

Found related TODO in [toolbelt/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63](https://slack-github.com/slack/toolbelt/blob/main/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63):

```go
token, err := jwt.ParseWithClaims(
    bearer,
    &toolbeltjwt.AccessTokenClaims{},
    k.Keyfunc,
    jwt.WithIssuer(self.AuthorizationServers[0]),
    // TODO: I forgot that mcp-go doesn't yet properly implement
    // RFC8707 so the correct audience is not yet being sent and not
    // being incorporated into the token, causing the integration
    // test suite to fail.
    // jwt.WithAudience(selfURL),
    jwt.WithValidMethods([]string{"RS256"}),
)
```

This explains the issue:
- The MCP Go client doesn't properly implement RFC8707 (OAuth 2.0 Resource Indicators)
- The correct audience isn't being sent in JWT tokens
- As a workaround, audience validation is disabled in toolbelt

Both MCP servers have disabled audience validation to work around this:
- **monitoring-ai-mcp:** Completely skips JWT validation
- **monitoring-mcp:** Validates JWT but disables audience check ([auth.py#L57](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L57))

### 5. Why monitoring-ai-mcp Works

The monitoring-ai-mcp server has JWT validation but **completely disabled** as a workaround:

**Config:** [monitoring-ai/mcp/gondola.yaml#L82](https://slack-github.com/slack/monitoring-ai/blob/main/mcp/gondola.yaml#L82)
```yaml
MCP_JWT_AUDIENCE: "https://monitoring-ai-mcp-dev.tinyspeck.com/monitoring/v1/mcp"
```

**Implementation:** Despite having `MCP_JWT_AUDIENCE` set, the server accepts any token without validation - it's a temporary workaround documented in the code comments.

### 6. Why monitoring-mcp Doesn't Work

The monitoring-mcp server has proper JWT validation enabled:

**Config:** [monitoring-mcp/gondola.yaml#L63](https://slack-github.com/slack/monitoring-mcp/blob/main/gondola.yaml#L63)
```yaml
MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp"
```

**Implementation:** Actually validates JWT tokens (signature, issuer, expiration) and rejects invalid tokens.

**Development mode option:** The server CAN skip validation if `MCP_JWT_AUDIENCE` is unset ([auth.py#L34-L36](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L34-L36)):
```python
# Skip JWT validation entirely if no audience is configured (development)
if not self.expected_audience:
    logging.info("Skipping JWT validation - development mode (no MCP_JWT_AUDIENCE configured)")
    return await call_next(request)
```

## Proposed Solutions

### Option 1: Disable JWT Validation in monitoring-mcp (Recommended for Dev)

Match the monitoring-ai-mcp workaround approach for development environments.

**Change:** Update [monitoring-mcp/gondola.yaml#L60-L63](https://slack-github.com/slack/monitoring-mcp/blob/main/gondola.yaml#L60-L63)
```yaml
environment:
  ENVIRONMENT: "development"
  PROMETHEUS_BASE_URL: "http://trickster-dev.internal.ec2.tinyspeck.com:9090"
  # MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp"  # Disabled for dev - no auth required
```

**Pros:**
- Simple one-line change
- Matches monitoring-ai-mcp behavior
- Works until proper OAuth is implemented

**Cons:**
- No authentication in dev environment
- Need to remember to re-enable for production

### Option 2: Update monitoring-mcp to Skip Validation Like monitoring-ai-mcp

Keep `MCP_JWT_AUDIENCE` set but skip validation in code.

**Change:** Update [monitoring-mcp/src/shared/auth.py#L46-L87](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L46-L87) to match the monitoring-ai-mcp implementation:
```python
try:
    # Log token details for debugging (no validation)
    # TODO: Pending MCP spec fix for RFC8707 audience handling
    # https://slack-github.com/slack/toolbelt/blob/main/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63
    unverified_payload = jwt.decode(token, options={"verify_signature": False})
    logging.info(f"JWT Debug - Allowing request for user: {unverified_payload.get('sub')} (validation skipped)")
    request.state.jwt_payload = unverified_payload
    request.state.user = unverified_payload.get('sub')
except Exception as e:
    logging.warning(f"JWT decode error: {e}")
    # Still allow the request even if decoding fails
```

**Pros:**
- Consistent behavior across both MCP servers
- Documents the workaround clearly
- Can still extract user info from token

**Cons:**
- Security-wise, accepting any token isn't ideal
- Need to update code in two places when the MCP spec is fixed

### Option 3: Remove Authorization Header from agent_config.yaml

Remove the Authorization header for prometheus and let it fail auth, forcing monitoring-mcp to be updated.

**Change:** Update [monitoring-ai-agent/monitoring-ai-slackbot/temporal/claude-worker/agent_config.yaml#L42-L46](https://slack-github.com/slack/monitoring-ai-agent/blob/main/monitoring-ai-slackbot/temporal/claude-worker/agent_config.yaml#L42-L46):
```yaml
prometheus:
  type: "http"
  url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp"
  # headers:
  #   Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}"
```

**Pros:**
- Forces the monitoring-mcp team to decide on auth approach
- Makes it clear auth isn't working
- Removes fake token that looks real but isn't

**Cons:**
- Doesn't solve the problem immediately
- Still requires monitoring-mcp changes

## Recommendation

**For immediate fix:** Use **Option 1** (disable JWT validation in monitoring-mcp dev environment)

**For long-term consistency:** Use **Option 2** (match monitoring-ai-mcp's workaround approach)

**For production:** Wait for proper MCP OAuth spec implementation and Claude Code SDK support, then re-enable full JWT validation with real tokens in both services.

## Related Documentation

- [Claude Code MCP Authentication Docs](https://docs.claude.com/en/docs/claude-code/mcp)
- [RFC 8707: OAuth 2.0 Resource Indicators](https://datatracker.ietf.org/doc/html/rfc8707)
- [Toolbelt Authentication Middleware](https://slack-github.com/slack/toolbelt/tree/main/pkg/toolbelt/middleware/toolbeltauth)

## Additional Context

The PR that added nebula firewall rules also added documentation to agent_config.yaml about the requirements:
[monitoring-ai-agent PR #26](https://slack-github.com/slack/monitoring-ai-agent/pull/26)

```yaml
# IMPORTANT: When adding new MCP servers, you must also:
# 1. Update settings.json to allow the MCP tools (e.g., mcp__servername__toolname)
# 2. Update gondola.yaml nebula outbound firewall rules to allow connections to the service's nebula groups
#    - Find the service's nebula groups: kubectl get pods -l app=<service> -o yaml | grep "slack.com/nebula.groups"
#    - Add outbound rule with those groups to gondola.yaml under nebula.firewall.outbound
```