Created
October 1, 2025 18:03
-
-
Save nullren/d6b8fdfd73d461617bdf21a5eab46d3a to your computer and use it in GitHub Desktop.
Revisions
-
nullren created this gist
Oct 1, 2025 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,280 @@ # MCP Authentication Investigation: Why prometheus MCP tools aren't working ## Problem Summary The prometheus MCP server (`monitoring-mcp`) is failing to connect from the claude-worker, while grafana and alertmanager MCP servers (`monitoring-ai-mcp`) are working fine. The error seen in logs: ``` 'mcp_servers': [ {'name': 'alertmanager', 'status': 'connected'}, {'name': 'grafana', 'status': 'connected'}, {'name': 'prometheus', 'status': 'failed'} ] ``` ## Root Cause The issue is a **JWT authentication mismatch** between how the two MCP servers handle authentication: ### monitoring-ai-mcp (Working ✅) **Location:** [monitoring-ai/mcp/unified_mcp_server.py#L76-L94](https://slack-github.com/slack/monitoring-ai/blob/main/mcp/unified_mcp_server.py#L76-L94) ```python # Log token details for debugging (no validation) # todo - pending change to toolbelt unverified_payload = jwt.decode(token, options={"verify_signature": False}) # ... # Still allow the request even if decoding fails ``` **Key behavior:** Accepts ANY token (even the fake `orchestration-agent` string) without validation. This is a workaround for the MCP JWT spec issues. ### monitoring-mcp (Failing ❌) **Location:** [monitoring-mcp/src/shared/auth.py#L46-L87](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L46-L87) ```python try: # Get signing key using PyJWKClient with built-in caching signing_key = self.jwks_client.get_signing_key_from_jwt(token) # Decode and validate JWT token payload = jwt.decode( token, signing_key.key, algorithms=["RS256"], issuer=self.expected_issuer, options={ "verify_signature": True, "verify_aud": False, # Disabled per toolbelt reference "verify_iss": True, "verify_exp": True } ) ``` **Key behavior:** Actually validates JWT tokens (signature, issuer, expiration). Rejects the fake `orchestration-agent` token. ## Investigation Details ### 1. Initial Discovery: Nebula Firewall Issue First issue was network connectivity - the pod couldn't reach monitoring-mcp due to missing nebula firewall rules. **Pod's nebula groups:** ```json ["monitoring-ai-slackbot-temporal-claude-worker", "vault:...", "dev", "us-east-1", "az-b"] ``` **Pod's outbound firewall rule:** ```yaml - port: "8000" proto: tcp groups: [dev, monitoring-ai-mcp] # WRONG - monitoring-mcp service has different groups ``` **monitoring-mcp's actual nebula groups:** ```json ["monitoring-mcp", "vault:monitoring-mcp-dev", "dev", "us-east-1"] ``` **Fix Applied:** [PR #26](https://slack-github.com/slack/monitoring-ai-agent/pull/26) - Added outbound rule for `monitoring-mcp` group - [gondola.yaml changes](https://slack-github.com/slack/monitoring-ai-agent/pull/26/files#diff-monitoring-ai-slackbot/gondola.yaml) ### 2. After Network Fix: Authentication Failure Even after network connectivity was established, the MCP server connection still failed. Testing showed: ```bash # Network connectivity works $ curl http://monitoring-mcp.service.dev-us-east-1.consul:8000 {"service":"Monitoring AI MCP Server", "tools":["hello_world","echo","server_info","prometheus_query","get_rules","list_metrics"]} # But authentication fails $ curl -X POST http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp \ -H "Authorization: Bearer orchestration-agent" {"error":"invalid_token","error_description":"The access token is malformed or invalid"} ``` ### 3. Configuration Analysis **claude-worker agent_config.yaml:** ```yaml mcp_servers: grafana: type: "http" url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=grafana" headers: Authorization: "Bearer ${GRAFANA_TOKEN:-orchestration-agent}" alertmanager: type: "http" url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=alertmanager" headers: Authorization: "Bearer ${ALERTMANAGER_TOKEN:-orchestration-agent}" prometheus: type: "http" url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp" headers: Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}" ``` All three use the same fake token `orchestration-agent`, but only prometheus fails. ### 4. The Toolbelt TODO Connection Found related TODO in [toolbelt/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63](https://slack-github.com/slack/toolbelt/blob/main/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63): ```go token, err := jwt.ParseWithClaims( bearer, &toolbeltjwt.AccessTokenClaims{}, k.Keyfunc, jwt.WithIssuer(self.AuthorizationServers[0]), // TODO: I forgot that mcp-go doesn't yet properly implement // RFC8707 so the correct audience is not yet being sent and not // being incorporated into the token, causing the integration // test suite to fail. // jwt.WithAudience(selfURL), jwt.WithValidMethods([]string{"RS256"}), ) ``` This explains the issue: - The MCP Go client doesn't properly implement RFC8707 (OAuth 2.0 Resource Indicators) - The correct audience isn't being sent in JWT tokens - As a workaround, audience validation is disabled in toolbelt Both MCP servers have disabled audience validation to work around this: - **monitoring-ai-mcp:** Completely skips JWT validation - **monitoring-mcp:** Validates JWT but disables audience check ([auth.py#L57](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L57)) ### 5. Why monitoring-ai-mcp Works The monitoring-ai-mcp server has JWT validation but **completely disabled** as a workaround: **Config:** [monitoring-ai/mcp/gondola.yaml#L82](https://slack-github.com/slack/monitoring-ai/blob/main/mcp/gondola.yaml#L82) ```yaml MCP_JWT_AUDIENCE: "https://monitoring-ai-mcp-dev.tinyspeck.com/monitoring/v1/mcp" ``` **Implementation:** Despite having `MCP_JWT_AUDIENCE` set, the server accepts any token without validation - it's a temporary workaround documented in the code comments. ### 6. Why monitoring-mcp Doesn't Work The monitoring-mcp server has proper JWT validation enabled: **Config:** [monitoring-mcp/gondola.yaml#L63](https://slack-github.com/slack/monitoring-mcp/blob/main/gondola.yaml#L63) ```yaml MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp" ``` **Implementation:** Actually validates JWT tokens (signature, issuer, expiration) and rejects invalid tokens. **Development mode option:** The server CAN skip validation if `MCP_JWT_AUDIENCE` is unset ([auth.py#L34-L36](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L34-L36)): ```python # Skip JWT validation entirely if no audience is configured (development) if not self.expected_audience: logging.info("Skipping JWT validation - development mode (no MCP_JWT_AUDIENCE configured)") return await call_next(request) ``` ## Proposed Solutions ### Option 1: Disable JWT Validation in monitoring-mcp (Recommended for Dev) Match the monitoring-ai-mcp workaround approach for development environments. **Change:** Update [monitoring-mcp/gondola.yaml#L60-L63](https://slack-github.com/slack/monitoring-mcp/blob/main/gondola.yaml#L60-L63) ```yaml environment: ENVIRONMENT: "development" PROMETHEUS_BASE_URL: "http://trickster-dev.internal.ec2.tinyspeck.com:9090" # MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp" # Disabled for dev - no auth required ``` **Pros:** - Simple one-line change - Matches monitoring-ai-mcp behavior - Works until proper OAuth is implemented **Cons:** - No authentication in dev environment - Need to remember to re-enable for production ### Option 2: Update monitoring-mcp to Skip Validation Like monitoring-ai-mcp Keep `MCP_JWT_AUDIENCE` set but skip validation in code. **Change:** Update [monitoring-mcp/src/shared/auth.py#L46-L87](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L46-L87) to match the monitoring-ai-mcp implementation: ```python try: # Log token details for debugging (no validation) # TODO: Pending MCP spec fix for RFC8707 audience handling # https://slack-github.com/slack/toolbelt/blob/main/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63 unverified_payload = jwt.decode(token, options={"verify_signature": False}) logging.info(f"JWT Debug - Allowing request for user: {unverified_payload.get('sub')} (validation skipped)") request.state.jwt_payload = unverified_payload request.state.user = unverified_payload.get('sub') except Exception as e: logging.warning(f"JWT decode error: {e}") # Still allow the request even if decoding fails ``` **Pros:** - Consistent behavior across both MCP servers - Documents the workaround clearly - Can still extract user info from token **Cons:** - Security-wise, accepting any token isn't ideal - Need to update code in two places when the MCP spec is fixed ### Option 3: Remove Authorization Header from agent_config.yaml Remove the Authorization header for prometheus and let it fail auth, forcing monitoring-mcp to be updated. **Change:** Update [monitoring-ai-agent/monitoring-ai-slackbot/temporal/claude-worker/agent_config.yaml#L42-L46](https://slack-github.com/slack/monitoring-ai-agent/blob/main/monitoring-ai-slackbot/temporal/claude-worker/agent_config.yaml#L42-L46): ```yaml prometheus: type: "http" url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp" # headers: # Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}" ``` **Pros:** - Forces the monitoring-mcp team to decide on auth approach - Makes it clear auth isn't working - Removes fake token that looks real but isn't **Cons:** - Doesn't solve the problem immediately - Still requires monitoring-mcp changes ## Recommendation **For immediate fix:** Use **Option 1** (disable JWT validation in monitoring-mcp dev environment) **For long-term consistency:** Use **Option 2** (match monitoring-ai-mcp's workaround approach) **For production:** Wait for proper MCP OAuth spec implementation and Claude Code SDK support, then re-enable full JWT validation with real tokens in both services. ## Related Documentation - [Claude Code MCP Authentication Docs](https://docs.claude.com/en/docs/claude-code/mcp) - [RFC 8707: OAuth 2.0 Resource Indicators](https://datatracker.ietf.org/doc/html/rfc8707) - [Toolbelt Authentication Middleware](https://slack-github.com/slack/toolbelt/tree/main/pkg/toolbelt/middleware/toolbeltauth) ## Additional Context The PR that added nebula firewall rules also added documentation to agent_config.yaml about the requirements: [monitoring-ai-agent PR #26](https://slack-github.com/slack/monitoring-ai-agent/pull/26) ```yaml # IMPORTANT: When adding new MCP servers, you must also: # 1. Update settings.json to allow the MCP tools (e.g., mcp__servername__toolname) # 2. Update gondola.yaml nebula outbound firewall rules to allow connections to the service's nebula groups # - Find the service's nebula groups: kubectl get pods -l app=<service> -o yaml | grep "slack.com/nebula.groups" # - Add outbound rule with those groups to gondola.yaml under nebula.firewall.outbound ```