nullren · October 1, 2025 18:03 · Oct 1, 2025
diff --git a/mcp-auth-investigation.md b/mcp-auth-investigation.md
@@ -0,0 +1,280 @@
+# MCP Authentication Investigation: Why prometheus MCP tools aren't working
+
+## Problem Summary
+
+The prometheus MCP server (`monitoring-mcp`) is failing to connect from the claude-worker, while grafana and alertmanager MCP servers (`monitoring-ai-mcp`) are working fine. The error seen in logs:
+
+```
+'mcp_servers': [
+  {'name': 'alertmanager', 'status': 'connected'},
+  {'name': 'grafana', 'status': 'connected'},
+  {'name': 'prometheus', 'status': 'failed'}
+]
+```
+
+## Root Cause
+
+The issue is a **JWT authentication mismatch** between how the two MCP servers handle authentication:
+
+### monitoring-ai-mcp (Working ✅)
+**Location:** [monitoring-ai/mcp/unified_mcp_server.py#L76-L94](https://slack-github.com/slack/monitoring-ai/blob/main/mcp/unified_mcp_server.py#L76-L94)
+
+```python
+# Log token details for debugging (no validation)
+# todo - pending change to toolbelt
+unverified_payload = jwt.decode(token, options={"verify_signature": False})
+# ...
+# Still allow the request even if decoding fails
+```
+
+**Key behavior:** Accepts ANY token (even the fake `orchestration-agent` string) without validation. This is a workaround for the MCP JWT spec issues.
+
+### monitoring-mcp (Failing ❌)
+**Location:** [monitoring-mcp/src/shared/auth.py#L46-L87](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L46-L87)
+
+```python
+try:
+    # Get signing key using PyJWKClient with built-in caching
+    signing_key = self.jwks_client.get_signing_key_from_jwt(token)
+
+    # Decode and validate JWT token
+    payload = jwt.decode(
+        token,
+        signing_key.key,
+        algorithms=["RS256"],
+        issuer=self.expected_issuer,
+        options={
+            "verify_signature": True,
+            "verify_aud": False,  # Disabled per toolbelt reference
+            "verify_iss": True,
+            "verify_exp": True
+        }
+    )
+```
+
+**Key behavior:** Actually validates JWT tokens (signature, issuer, expiration). Rejects the fake `orchestration-agent` token.
+
+## Investigation Details
+
+### 1. Initial Discovery: Nebula Firewall Issue
+
+First issue was network connectivity - the pod couldn't reach monitoring-mcp due to missing nebula firewall rules.
+
+**Pod's nebula groups:**
+```json
+["monitoring-ai-slackbot-temporal-claude-worker", "vault:...", "dev", "us-east-1", "az-b"]
+```
+
+**Pod's outbound firewall rule:**
+```yaml
+- port: "8000"
+  proto: tcp
+  groups: [dev, monitoring-ai-mcp]  # WRONG - monitoring-mcp service has different groups
+```
+
+**monitoring-mcp's actual nebula groups:**
+```json
+["monitoring-mcp", "vault:monitoring-mcp-dev", "dev", "us-east-1"]
+```
+
+**Fix Applied:** [PR #26](https://slack-github.com/slack/monitoring-ai-agent/pull/26) - Added outbound rule for `monitoring-mcp` group
+- [gondola.yaml changes](https://slack-github.com/slack/monitoring-ai-agent/pull/26/files#diff-monitoring-ai-slackbot/gondola.yaml)
+
+### 2. After Network Fix: Authentication Failure
+
+Even after network connectivity was established, the MCP server connection still failed. Testing showed:
+
+```bash
+# Network connectivity works
+$ curl http://monitoring-mcp.service.dev-us-east-1.consul:8000
+{"service":"Monitoring AI MCP Server", "tools":["hello_world","echo","server_info","prometheus_query","get_rules","list_metrics"]}
+
+# But authentication fails
+$ curl -X POST http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp \
+  -H "Authorization: Bearer orchestration-agent"
+{"error":"invalid_token","error_description":"The access token is malformed or invalid"}
+```
+
+### 3. Configuration Analysis
+
+**claude-worker agent_config.yaml:**
+```yaml
+mcp_servers:
+  grafana:
+    type: "http"
+    url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=grafana"
+    headers:
+      Authorization: "Bearer ${GRAFANA_TOKEN:-orchestration-agent}"
+
+  alertmanager:
+    type: "http"
+    url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=alertmanager"
+    headers:
+      Authorization: "Bearer ${ALERTMANAGER_TOKEN:-orchestration-agent}"
+
+  prometheus:
+    type: "http"
+    url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp"
+    headers:
+      Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}"
+```
+
+All three use the same fake token `orchestration-agent`, but only prometheus fails.
+
+### 4. The Toolbelt TODO Connection
+
+Found related TODO in [toolbelt/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63](https://slack-github.com/slack/toolbelt/blob/main/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63):
+
+```go
+token, err := jwt.ParseWithClaims(
+    bearer,
+    &toolbeltjwt.AccessTokenClaims{},
+    k.Keyfunc,
+    jwt.WithIssuer(self.AuthorizationServers[0]),
+    // TODO: I forgot that mcp-go doesn't yet properly implement
+    // RFC8707 so the correct audience is not yet being sent and not
+    // being incorporated into the token, causing the integration
+    // test suite to fail.
+    // jwt.WithAudience(selfURL),
+    jwt.WithValidMethods([]string{"RS256"}),
+)
+```
+
+This explains the issue:
+- The MCP Go client doesn't properly implement RFC8707 (OAuth 2.0 Resource Indicators)
+- The correct audience isn't being sent in JWT tokens
+- As a workaround, audience validation is disabled in toolbelt
+
+Both MCP servers have disabled audience validation to work around this:
+- **monitoring-ai-mcp:** Completely skips JWT validation
+- **monitoring-mcp:** Validates JWT but disables audience check ([auth.py#L57](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L57))
+
+### 5. Why monitoring-ai-mcp Works
+
+The monitoring-ai-mcp server has JWT validation but **completely disabled** as a workaround:
+
+**Config:** [monitoring-ai/mcp/gondola.yaml#L82](https://slack-github.com/slack/monitoring-ai/blob/main/mcp/gondola.yaml#L82)
+```yaml
+MCP_JWT_AUDIENCE: "https://monitoring-ai-mcp-dev.tinyspeck.com/monitoring/v1/mcp"
+```
+
+**Implementation:** Despite having `MCP_JWT_AUDIENCE` set, the server accepts any token without validation - it's a temporary workaround documented in the code comments.
+
+### 6. Why monitoring-mcp Doesn't Work
+
+The monitoring-mcp server has proper JWT validation enabled:
+
+**Config:** [monitoring-mcp/gondola.yaml#L63](https://slack-github.com/slack/monitoring-mcp/blob/main/gondola.yaml#L63)
+```yaml
+MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp"
+```
+
+**Implementation:** Actually validates JWT tokens (signature, issuer, expiration) and rejects invalid tokens.
+
+**Development mode option:** The server CAN skip validation if `MCP_JWT_AUDIENCE` is unset ([auth.py#L34-L36](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L34-L36)):
+```python
+# Skip JWT validation entirely if no audience is configured (development)
+if not self.expected_audience:
+    logging.info("Skipping JWT validation - development mode (no MCP_JWT_AUDIENCE configured)")
+    return await call_next(request)
+```
+
+## Proposed Solutions
+
+### Option 1: Disable JWT Validation in monitoring-mcp (Recommended for Dev)
+
+Match the monitoring-ai-mcp workaround approach for development environments.
+
+**Change:** Update [monitoring-mcp/gondola.yaml#L60-L63](https://slack-github.com/slack/monitoring-mcp/blob/main/gondola.yaml#L60-L63)
+```yaml
+environment:
+  ENVIRONMENT: "development"
+  PROMETHEUS_BASE_URL: "http://trickster-dev.internal.ec2.tinyspeck.com:9090"
+  # MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp"  # Disabled for dev - no auth required
+```
+
+**Pros:**
+- Simple one-line change
+- Matches monitoring-ai-mcp behavior
+- Works until proper OAuth is implemented
+
+**Cons:**
+- No authentication in dev environment
+- Need to remember to re-enable for production
+
+### Option 2: Update monitoring-mcp to Skip Validation Like monitoring-ai-mcp
+
+Keep `MCP_JWT_AUDIENCE` set but skip validation in code.
+
+**Change:** Update [monitoring-mcp/src/shared/auth.py#L46-L87](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L46-L87) to match the monitoring-ai-mcp implementation:
+```python
+try:
+    # Log token details for debugging (no validation)
+    # TODO: Pending MCP spec fix for RFC8707 audience handling
+    # https://slack-github.com/slack/toolbelt/blob/main/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63
+    unverified_payload = jwt.decode(token, options={"verify_signature": False})
+    logging.info(f"JWT Debug - Allowing request for user: {unverified_payload.get('sub')} (validation skipped)")
+    request.state.jwt_payload = unverified_payload
+    request.state.user = unverified_payload.get('sub')
+except Exception as e:
+    logging.warning(f"JWT decode error: {e}")
+    # Still allow the request even if decoding fails
+```
+
+**Pros:**
+- Consistent behavior across both MCP servers
+- Documents the workaround clearly
+- Can still extract user info from token
+
+**Cons:**
+- Security-wise, accepting any token isn't ideal
+- Need to update code in two places when the MCP spec is fixed
+
+### Option 3: Remove Authorization Header from agent_config.yaml
+
+Remove the Authorization header for prometheus and let it fail auth, forcing monitoring-mcp to be updated.
+
+**Change:** Update [monitoring-ai-agent/monitoring-ai-slackbot/temporal/claude-worker/agent_config.yaml#L42-L46](https://slack-github.com/slack/monitoring-ai-agent/blob/main/monitoring-ai-slackbot/temporal/claude-worker/agent_config.yaml#L42-L46):
+```yaml
+prometheus:
+  type: "http"
+  url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp"
+  # headers:
+  #   Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}"
+```
+
+**Pros:**
+- Forces the monitoring-mcp team to decide on auth approach
+- Makes it clear auth isn't working
+- Removes fake token that looks real but isn't
+
+**Cons:**
+- Doesn't solve the problem immediately
+- Still requires monitoring-mcp changes
+
+## Recommendation
+
+**For immediate fix:** Use **Option 1** (disable JWT validation in monitoring-mcp dev environment)
+
+**For long-term consistency:** Use **Option 2** (match monitoring-ai-mcp's workaround approach)
+
+**For production:** Wait for proper MCP OAuth spec implementation and Claude Code SDK support, then re-enable full JWT validation with real tokens in both services.
+
+## Related Documentation
+
+- [Claude Code MCP Authentication Docs](https://docs.claude.com/en/docs/claude-code/mcp)
+- [RFC 8707: OAuth 2.0 Resource Indicators](https://datatracker.ietf.org/doc/html/rfc8707)
+- [Toolbelt Authentication Middleware](https://slack-github.com/slack/toolbelt/tree/main/pkg/toolbelt/middleware/toolbeltauth)
+
+## Additional Context
+
+The PR that added nebula firewall rules also added documentation to agent_config.yaml about the requirements:
+[monitoring-ai-agent PR #26](https://slack-github.com/slack/monitoring-ai-agent/pull/26)
+
+```yaml
+# IMPORTANT: When adding new MCP servers, you must also:
+# 1. Update settings.json to allow the MCP tools (e.g., mcp__servername__toolname)
+# 2. Update gondola.yaml nebula outbound firewall rules to allow connections to the service's nebula groups
+#    - Find the service's nebula groups: kubectl get pods -l app=<service> -o yaml | grep "slack.com/nebula.groups"
+#    - Add outbound rule with those groups to gondola.yaml under nebula.firewall.outbound
+```
No results found