Skip to content

Instantly share code, notes, and snippets.

@nullren
Created October 1, 2025 18:03
Show Gist options
  • Select an option

  • Save nullren/d6b8fdfd73d461617bdf21a5eab46d3a to your computer and use it in GitHub Desktop.

Select an option

Save nullren/d6b8fdfd73d461617bdf21a5eab46d3a to your computer and use it in GitHub Desktop.

Revisions

  1. nullren created this gist Oct 1, 2025.
    280 changes: 280 additions & 0 deletions mcp-auth-investigation.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,280 @@
    # MCP Authentication Investigation: Why prometheus MCP tools aren't working

    ## Problem Summary

    The prometheus MCP server (`monitoring-mcp`) is failing to connect from the claude-worker, while grafana and alertmanager MCP servers (`monitoring-ai-mcp`) are working fine. The error seen in logs:

    ```
    'mcp_servers': [
    {'name': 'alertmanager', 'status': 'connected'},
    {'name': 'grafana', 'status': 'connected'},
    {'name': 'prometheus', 'status': 'failed'}
    ]
    ```

    ## Root Cause

    The issue is a **JWT authentication mismatch** between how the two MCP servers handle authentication:

    ### monitoring-ai-mcp (Working ✅)
    **Location:** [monitoring-ai/mcp/unified_mcp_server.py#L76-L94](https://slack-github.com/slack/monitoring-ai/blob/main/mcp/unified_mcp_server.py#L76-L94)

    ```python
    # Log token details for debugging (no validation)
    # todo - pending change to toolbelt
    unverified_payload = jwt.decode(token, options={"verify_signature": False})
    # ...
    # Still allow the request even if decoding fails
    ```

    **Key behavior:** Accepts ANY token (even the fake `orchestration-agent` string) without validation. This is a workaround for the MCP JWT spec issues.

    ### monitoring-mcp (Failing ❌)
    **Location:** [monitoring-mcp/src/shared/auth.py#L46-L87](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L46-L87)

    ```python
    try:
    # Get signing key using PyJWKClient with built-in caching
    signing_key = self.jwks_client.get_signing_key_from_jwt(token)

    # Decode and validate JWT token
    payload = jwt.decode(
    token,
    signing_key.key,
    algorithms=["RS256"],
    issuer=self.expected_issuer,
    options={
    "verify_signature": True,
    "verify_aud": False, # Disabled per toolbelt reference
    "verify_iss": True,
    "verify_exp": True
    }
    )
    ```

    **Key behavior:** Actually validates JWT tokens (signature, issuer, expiration). Rejects the fake `orchestration-agent` token.

    ## Investigation Details

    ### 1. Initial Discovery: Nebula Firewall Issue

    First issue was network connectivity - the pod couldn't reach monitoring-mcp due to missing nebula firewall rules.

    **Pod's nebula groups:**
    ```json
    ["monitoring-ai-slackbot-temporal-claude-worker", "vault:...", "dev", "us-east-1", "az-b"]
    ```

    **Pod's outbound firewall rule:**
    ```yaml
    - port: "8000"
    proto: tcp
    groups: [dev, monitoring-ai-mcp] # WRONG - monitoring-mcp service has different groups
    ```
    **monitoring-mcp's actual nebula groups:**
    ```json
    ["monitoring-mcp", "vault:monitoring-mcp-dev", "dev", "us-east-1"]
    ```

    **Fix Applied:** [PR #26](https://slack-github.com/slack/monitoring-ai-agent/pull/26) - Added outbound rule for `monitoring-mcp` group
    - [gondola.yaml changes](https://slack-github.com/slack/monitoring-ai-agent/pull/26/files#diff-monitoring-ai-slackbot/gondola.yaml)

    ### 2. After Network Fix: Authentication Failure

    Even after network connectivity was established, the MCP server connection still failed. Testing showed:

    ```bash
    # Network connectivity works
    $ curl http://monitoring-mcp.service.dev-us-east-1.consul:8000
    {"service":"Monitoring AI MCP Server", "tools":["hello_world","echo","server_info","prometheus_query","get_rules","list_metrics"]}

    # But authentication fails
    $ curl -X POST http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp \
    -H "Authorization: Bearer orchestration-agent"
    {"error":"invalid_token","error_description":"The access token is malformed or invalid"}
    ```

    ### 3. Configuration Analysis

    **claude-worker agent_config.yaml:**
    ```yaml
    mcp_servers:
    grafana:
    type: "http"
    url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=grafana"
    headers:
    Authorization: "Bearer ${GRAFANA_TOKEN:-orchestration-agent}"

    alertmanager:
    type: "http"
    url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=alertmanager"
    headers:
    Authorization: "Bearer ${ALERTMANAGER_TOKEN:-orchestration-agent}"

    prometheus:
    type: "http"
    url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp"
    headers:
    Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}"
    ```
    All three use the same fake token `orchestration-agent`, but only prometheus fails.

    ### 4. The Toolbelt TODO Connection

    Found related TODO in [toolbelt/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63](https://slack-github.com/slack/toolbelt/blob/main/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63):

    ```go
    token, err := jwt.ParseWithClaims(
    bearer,
    &toolbeltjwt.AccessTokenClaims{},
    k.Keyfunc,
    jwt.WithIssuer(self.AuthorizationServers[0]),
    // TODO: I forgot that mcp-go doesn't yet properly implement
    // RFC8707 so the correct audience is not yet being sent and not
    // being incorporated into the token, causing the integration
    // test suite to fail.
    // jwt.WithAudience(selfURL),
    jwt.WithValidMethods([]string{"RS256"}),
    )
    ```

    This explains the issue:
    - The MCP Go client doesn't properly implement RFC8707 (OAuth 2.0 Resource Indicators)
    - The correct audience isn't being sent in JWT tokens
    - As a workaround, audience validation is disabled in toolbelt

    Both MCP servers have disabled audience validation to work around this:
    - **monitoring-ai-mcp:** Completely skips JWT validation
    - **monitoring-mcp:** Validates JWT but disables audience check ([auth.py#L57](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L57))

    ### 5. Why monitoring-ai-mcp Works

    The monitoring-ai-mcp server has JWT validation but **completely disabled** as a workaround:

    **Config:** [monitoring-ai/mcp/gondola.yaml#L82](https://slack-github.com/slack/monitoring-ai/blob/main/mcp/gondola.yaml#L82)
    ```yaml
    MCP_JWT_AUDIENCE: "https://monitoring-ai-mcp-dev.tinyspeck.com/monitoring/v1/mcp"
    ```

    **Implementation:** Despite having `MCP_JWT_AUDIENCE` set, the server accepts any token without validation - it's a temporary workaround documented in the code comments.

    ### 6. Why monitoring-mcp Doesn't Work

    The monitoring-mcp server has proper JWT validation enabled:

    **Config:** [monitoring-mcp/gondola.yaml#L63](https://slack-github.com/slack/monitoring-mcp/blob/main/gondola.yaml#L63)
    ```yaml
    MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp"
    ```

    **Implementation:** Actually validates JWT tokens (signature, issuer, expiration) and rejects invalid tokens.

    **Development mode option:** The server CAN skip validation if `MCP_JWT_AUDIENCE` is unset ([auth.py#L34-L36](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L34-L36)):
    ```python
    # Skip JWT validation entirely if no audience is configured (development)
    if not self.expected_audience:
    logging.info("Skipping JWT validation - development mode (no MCP_JWT_AUDIENCE configured)")
    return await call_next(request)
    ```

    ## Proposed Solutions

    ### Option 1: Disable JWT Validation in monitoring-mcp (Recommended for Dev)

    Match the monitoring-ai-mcp workaround approach for development environments.

    **Change:** Update [monitoring-mcp/gondola.yaml#L60-L63](https://slack-github.com/slack/monitoring-mcp/blob/main/gondola.yaml#L60-L63)
    ```yaml
    environment:
    ENVIRONMENT: "development"
    PROMETHEUS_BASE_URL: "http://trickster-dev.internal.ec2.tinyspeck.com:9090"
    # MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp" # Disabled for dev - no auth required
    ```

    **Pros:**
    - Simple one-line change
    - Matches monitoring-ai-mcp behavior
    - Works until proper OAuth is implemented

    **Cons:**
    - No authentication in dev environment
    - Need to remember to re-enable for production

    ### Option 2: Update monitoring-mcp to Skip Validation Like monitoring-ai-mcp

    Keep `MCP_JWT_AUDIENCE` set but skip validation in code.

    **Change:** Update [monitoring-mcp/src/shared/auth.py#L46-L87](https://slack-github.com/slack/monitoring-mcp/blob/main/src/shared/auth.py#L46-L87) to match the monitoring-ai-mcp implementation:
    ```python
    try:
    # Log token details for debugging (no validation)
    # TODO: Pending MCP spec fix for RFC8707 audience handling
    # https://slack-github.com/slack/toolbelt/blob/main/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63
    unverified_payload = jwt.decode(token, options={"verify_signature": False})
    logging.info(f"JWT Debug - Allowing request for user: {unverified_payload.get('sub')} (validation skipped)")
    request.state.jwt_payload = unverified_payload
    request.state.user = unverified_payload.get('sub')
    except Exception as e:
    logging.warning(f"JWT decode error: {e}")
    # Still allow the request even if decoding fails
    ```

    **Pros:**
    - Consistent behavior across both MCP servers
    - Documents the workaround clearly
    - Can still extract user info from token

    **Cons:**
    - Security-wise, accepting any token isn't ideal
    - Need to update code in two places when the MCP spec is fixed

    ### Option 3: Remove Authorization Header from agent_config.yaml

    Remove the Authorization header for prometheus and let it fail auth, forcing monitoring-mcp to be updated.

    **Change:** Update [monitoring-ai-agent/monitoring-ai-slackbot/temporal/claude-worker/agent_config.yaml#L42-L46](https://slack-github.com/slack/monitoring-ai-agent/blob/main/monitoring-ai-slackbot/temporal/claude-worker/agent_config.yaml#L42-L46):
    ```yaml
    prometheus:
    type: "http"
    url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp"
    # headers:
    # Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}"
    ```

    **Pros:**
    - Forces the monitoring-mcp team to decide on auth approach
    - Makes it clear auth isn't working
    - Removes fake token that looks real but isn't

    **Cons:**
    - Doesn't solve the problem immediately
    - Still requires monitoring-mcp changes

    ## Recommendation

    **For immediate fix:** Use **Option 1** (disable JWT validation in monitoring-mcp dev environment)

    **For long-term consistency:** Use **Option 2** (match monitoring-ai-mcp's workaround approach)

    **For production:** Wait for proper MCP OAuth spec implementation and Claude Code SDK support, then re-enable full JWT validation with real tokens in both services.

    ## Related Documentation

    - [Claude Code MCP Authentication Docs](https://docs.claude.com/en/docs/claude-code/mcp)
    - [RFC 8707: OAuth 2.0 Resource Indicators](https://datatracker.ietf.org/doc/html/rfc8707)
    - [Toolbelt Authentication Middleware](https://slack-github.com/slack/toolbelt/tree/main/pkg/toolbelt/middleware/toolbeltauth)

    ## Additional Context

    The PR that added nebula firewall rules also added documentation to agent_config.yaml about the requirements:
    [monitoring-ai-agent PR #26](https://slack-github.com/slack/monitoring-ai-agent/pull/26)

    ```yaml
    # IMPORTANT: When adding new MCP servers, you must also:
    # 1. Update settings.json to allow the MCP tools (e.g., mcp__servername__toolname)
    # 2. Update gondola.yaml nebula outbound firewall rules to allow connections to the service's nebula groups
    # - Find the service's nebula groups: kubectl get pods -l app=<service> -o yaml | grep "slack.com/nebula.groups"
    # - Add outbound rule with those groups to gondola.yaml under nebula.firewall.outbound
    ```