# MCP Authentication Investigation: Why prometheus MCP tools aren't working ## Problem Summary The prometheus MCP server (`monitoring-mcp`) is failing to connect from the claude-worker, while grafana and alertmanager MCP servers (`monitoring-ai-mcp`) are working fine. The error seen in logs: ``` 'mcp_servers': [ {'name': 'alertmanager', 'status': 'connected'}, {'name': 'grafana', 'status': 'connected'}, {'name': 'prometheus', 'status': 'failed'} ] ``` ## Root Cause The issue is a **JWT authentication mismatch** between how the two MCP servers handle authentication: ### monitoring-ai-mcp (Working ✅) **Location:** [monitoring-ai/mcp/unified_mcp_server.py#L76-L94](https://slack-github.com/slack/monitoring-ai/blob/main/mcp/unified_mcp_server.py#L76-L94) ```python # Log token details for debugging (no validation) # todo - pending change to toolbelt unverified_payload = jwt.decode(token, options={"verify_signature": False}) # ... # Still allow the request even if decoding fails ``` **Key behavior:** Accepts ANY token (even the fake `orchestration-agent` string) without validation. This is a workaround for the MCP JWT spec issues. ### monitoring-mcp (Failing ❌) **Location:** [monitoring-mcp/src/shared/auth.py#L46-L87](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L46-L87) ```python try: # Get signing key using PyJWKClient with built-in caching signing_key = self.jwks_client.get_signing_key_from_jwt(token) # Decode and validate JWT token payload = jwt.decode( token, signing_key.key, algorithms=["RS256"], issuer=self.expected_issuer, options={ "verify_signature": True, "verify_aud": False, # Disabled per toolbelt reference "verify_iss": True, "verify_exp": True } ) ``` **Key behavior:** Actually validates JWT tokens (signature, issuer, expiration). Rejects the fake `orchestration-agent` token. ## Investigation Details ### 1. Initial Discovery: Nebula Firewall Issue First issue was network connectivity - the pod couldn't reach monitoring-mcp due to missing nebula firewall rules. **Pod's nebula groups:** ```json ["monitoring-ai-slackbot-temporal-claude-worker", "vault:...", "dev", "us-east-1", "az-b"] ``` **Pod's outbound firewall rule:** ```yaml - port: "8000" proto: tcp groups: [dev, monitoring-ai-mcp] # WRONG - monitoring-mcp service has different groups ``` **monitoring-mcp's actual nebula groups:** ```json ["monitoring-mcp", "vault:monitoring-mcp-dev", "dev", "us-east-1"] ``` **Fix Applied:** [PR #26](https://slack-github.com/slack/monitoring-ai-agent/pull/26) - Added outbound rule for `monitoring-mcp` group - [gondola.yaml changes](https://slack-github.com/slack/monitoring-ai-agent/pull/26/files#diff-monitoring-ai-slackbot/gondola.yaml) ### 2. After Network Fix: Authentication Failure Even after network connectivity was established, the MCP server connection still failed. Testing showed: ```bash # Network connectivity works $ curl http://monitoring-mcp.service.dev-us-east-1.consul:8000 {"service":"Monitoring AI MCP Server", "tools":["hello_world","echo","server_info","prometheus_query","get_rules","list_metrics"]} # But authentication fails $ curl -X POST http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp \ -H "Authorization: Bearer orchestration-agent" {"error":"invalid_token","error_description":"The access token is malformed or invalid"} ``` ### 3. Configuration Analysis **claude-worker agent_config.yaml:** ```yaml mcp_servers: grafana: type: "http" url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=grafana" headers: Authorization: "Bearer ${GRAFANA_TOKEN:-orchestration-agent}" alertmanager: type: "http" url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=alertmanager" headers: Authorization: "Bearer ${ALERTMANAGER_TOKEN:-orchestration-agent}" prometheus: type: "http" url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp" headers: Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}" ``` All three use the same fake token `orchestration-agent`, but only prometheus fails. ### 4. The Toolbelt TODO Connection Found related TODO in [toolbelt/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63](https://slack-github.com/slack/toolbelt/blob/main/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63): ```go token, err := jwt.ParseWithClaims( bearer, &toolbeltjwt.AccessTokenClaims{}, k.Keyfunc, jwt.WithIssuer(self.AuthorizationServers[0]), // TODO: I forgot that mcp-go doesn't yet properly implement // RFC8707 so the correct audience is not yet being sent and not // being incorporated into the token, causing the integration // test suite to fail. // jwt.WithAudience(selfURL), jwt.WithValidMethods([]string{"RS256"}), ) ``` This explains the issue: - The MCP Go client doesn't properly implement RFC8707 (OAuth 2.0 Resource Indicators) - The correct audience isn't being sent in JWT tokens - As a workaround, audience validation is disabled in toolbelt Both MCP servers have disabled audience validation to work around this: - **monitoring-ai-mcp:** Completely skips JWT validation - **monitoring-mcp:** Validates JWT but disables audience check ([auth.py#L57](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L57)) ### 5. Why monitoring-ai-mcp Works The monitoring-ai-mcp server has JWT validation but **completely disabled** as a workaround: **Config:** [monitoring-ai/mcp/gondola.yaml#L82](https://slack-github.com/slack/monitoring-ai/blob/main/mcp/gondola.yaml#L82) ```yaml MCP_JWT_AUDIENCE: "https://monitoring-ai-mcp-dev.tinyspeck.com/monitoring/v1/mcp" ``` **Implementation:** Despite having `MCP_JWT_AUDIENCE` set, the server accepts any token without validation - it's a temporary workaround documented in the code comments. ### 6. Why monitoring-mcp Doesn't Work The monitoring-mcp server has proper JWT validation enabled: **Config:** [monitoring-mcp/gondola.yaml#L63](https://slack-github.com/slack/monitoring-mcp/blob/main/gondola.yaml#L63) ```yaml MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp" ``` **Implementation:** Actually validates JWT tokens (signature, issuer, expiration) and rejects invalid tokens. **Development mode option:** The server CAN skip validation if `MCP_JWT_AUDIENCE` is unset ([auth.py#L34-L36](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L34-L36)): ```python # Skip JWT validation entirely if no audience is configured (development) if not self.expected_audience: logging.info("Skipping JWT validation - development mode (no MCP_JWT_AUDIENCE configured)") return await call_next(request) ``` ## Proposed Solutions ### Option 1: Disable JWT Validation in monitoring-mcp (Recommended for Dev) Match the monitoring-ai-mcp workaround approach for development environments. **Change:** Update [monitoring-mcp/gondola.yaml#L60-L63](https://slack-github.com/slack/monitoring-mcp/blob/main/gondola.yaml#L60-L63) ```yaml environment: ENVIRONMENT: "development" PROMETHEUS_BASE_URL: "http://trickster-dev.internal.ec2.tinyspeck.com:9090" # MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp" # Disabled for dev - no auth required ``` **Pros:** - Simple one-line change - Matches monitoring-ai-mcp behavior - Works until proper OAuth is implemented **Cons:** - No authentication in dev environment - Need to remember to re-enable for production ### Option 2: Update monitoring-mcp to Skip Validation Like monitoring-ai-mcp Keep `MCP_JWT_AUDIENCE` set but skip validation in code. **Change:** Update [monitoring-mcp/src/shared/auth.py#L46-L87](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L46-L87) to match the monitoring-ai-mcp implementation: ```python try: # Log token details for debugging (no validation) # TODO: Pending MCP spec fix for RFC8707 audience handling # https://slack-github.com/slack/toolbelt/blob/main/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63 unverified_payload = jwt.decode(token, options={"verify_signature": False}) logging.info(f"JWT Debug - Allowing request for user: {unverified_payload.get('sub')} (validation skipped)") request.state.jwt_payload = unverified_payload request.state.user = unverified_payload.get('sub') except Exception as e: logging.warning(f"JWT decode error: {e}") # Still allow the request even if decoding fails ``` **Pros:** - Consistent behavior across both MCP servers - Documents the workaround clearly - Can still extract user info from token **Cons:** - Security-wise, accepting any token isn't ideal - Need to update code in two places when the MCP spec is fixed ### Option 3: Remove Authorization Header from agent_config.yaml Remove the Authorization header for prometheus and let it fail auth, forcing monitoring-mcp to be updated. **Change:** Update [monitoring-ai-agent/monitoring-ai-slackbot/temporal/claude-worker/agent_config.yaml#L42-L46](https://slack-github.com/slack/monitoring-ai-agent/blob/main/monitoring-ai-slackbot/temporal/claude-worker/agent_config.yaml#L42-L46): ```yaml prometheus: type: "http" url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp" # headers: # Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}" ``` **Pros:** - Forces the monitoring-mcp team to decide on auth approach - Makes it clear auth isn't working - Removes fake token that looks real but isn't **Cons:** - Doesn't solve the problem immediately - Still requires monitoring-mcp changes ## Recommendation **For immediate fix:** Use **Option 1** (disable JWT validation in monitoring-mcp dev environment) **For long-term consistency:** Use **Option 2** (match monitoring-ai-mcp's workaround approach) **For production:** Wait for proper MCP OAuth spec implementation and Claude Code SDK support, then re-enable full JWT validation with real tokens in both services. ## Related Documentation - [Claude Code MCP Authentication Docs](https://docs.claude.com/en/docs/claude-code/mcp) - [RFC 8707: OAuth 2.0 Resource Indicators](https://datatracker.ietf.org/doc/html/rfc8707) - [Toolbelt Authentication Middleware](https://slack-github.com/slack/toolbelt/tree/main/pkg/toolbelt/middleware/toolbeltauth) ## Additional Context The PR that added nebula firewall rules also added documentation to agent_config.yaml about the requirements: [monitoring-ai-agent PR #26](https://slack-github.com/slack/monitoring-ai-agent/pull/26) ```yaml # IMPORTANT: When adding new MCP servers, you must also: # 1. Update settings.json to allow the MCP tools (e.g., mcp__servername__toolname) # 2. Update gondola.yaml nebula outbound firewall rules to allow connections to the service's nebula groups # - Find the service's nebula groups: kubectl get pods -l app= -o yaml | grep "slack.com/nebula.groups" # - Add outbound rule with those groups to gondola.yaml under nebula.firewall.outbound ```