The prometheus MCP server (monitoring-mcp) is failing to connect from the claude-worker, while grafana and alertmanager MCP servers (monitoring-ai-mcp) are working fine. The error seen in logs:
'mcp_servers': [
{'name': 'alertmanager', 'status': 'connected'},
{'name': 'grafana', 'status': 'connected'},
{'name': 'prometheus', 'status': 'failed'}
]
The issue is a JWT authentication mismatch between how the two MCP servers handle authentication:
Location: monitoring-ai/mcp/unified_mcp_server.py#L76-L94
# Log token details for debugging (no validation)
# todo - pending change to toolbelt
unverified_payload = jwt.decode(token, options={"verify_signature": False})
# ...
# Still allow the request even if decoding failsKey behavior: Accepts ANY token (even the fake orchestration-agent string) without validation. This is a workaround for the MCP JWT spec issues.
Location: monitoring-mcp/src/shared/auth.py#L46-L87
try:
# Get signing key using PyJWKClient with built-in caching
signing_key = self.jwks_client.get_signing_key_from_jwt(token)
# Decode and validate JWT token
payload = jwt.decode(
token,
signing_key.key,
algorithms=["RS256"],
issuer=self.expected_issuer,
options={
"verify_signature": True,
"verify_aud": False, # Disabled per toolbelt reference
"verify_iss": True,
"verify_exp": True
}
)Key behavior: Actually validates JWT tokens (signature, issuer, expiration). Rejects the fake orchestration-agent token.
First issue was network connectivity - the pod couldn't reach monitoring-mcp due to missing nebula firewall rules.
Pod's nebula groups:
["monitoring-ai-slackbot-temporal-claude-worker", "vault:...", "dev", "us-east-1", "az-b"]Pod's outbound firewall rule:
- port: "8000"
proto: tcp
groups: [dev, monitoring-ai-mcp] # WRONG - monitoring-mcp service has different groupsmonitoring-mcp's actual nebula groups:
["monitoring-mcp", "vault:monitoring-mcp-dev", "dev", "us-east-1"]Fix Applied: PR #26 - Added outbound rule for monitoring-mcp group
Even after network connectivity was established, the MCP server connection still failed. Testing showed:
# Network connectivity works
$ curl http://monitoring-mcp.service.dev-us-east-1.consul:8000
{"service":"Monitoring AI MCP Server", "tools":["hello_world","echo","server_info","prometheus_query","get_rules","list_metrics"]}
# But authentication fails
$ curl -X POST http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp \
-H "Authorization: Bearer orchestration-agent"
{"error":"invalid_token","error_description":"The access token is malformed or invalid"}claude-worker agent_config.yaml:
mcp_servers:
grafana:
type: "http"
url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=grafana"
headers:
Authorization: "Bearer ${GRAFANA_TOKEN:-orchestration-agent}"
alertmanager:
type: "http"
url: "${MONITORING_AI_MCP_URL}/monitoring/v1/mcp?toolset=alertmanager"
headers:
Authorization: "Bearer ${ALERTMANAGER_TOKEN:-orchestration-agent}"
prometheus:
type: "http"
url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp"
headers:
Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}"All three use the same fake token orchestration-agent, but only prometheus fails.
Found related TODO in toolbelt/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63:
token, err := jwt.ParseWithClaims(
bearer,
&toolbeltjwt.AccessTokenClaims{},
k.Keyfunc,
jwt.WithIssuer(self.AuthorizationServers[0]),
// TODO: I forgot that mcp-go doesn't yet properly implement
// RFC8707 so the correct audience is not yet being sent and not
// being incorporated into the token, causing the integration
// test suite to fail.
// jwt.WithAudience(selfURL),
jwt.WithValidMethods([]string{"RS256"}),
)This explains the issue:
- The MCP Go client doesn't properly implement RFC8707 (OAuth 2.0 Resource Indicators)
- The correct audience isn't being sent in JWT tokens
- As a workaround, audience validation is disabled in toolbelt
Both MCP servers have disabled audience validation to work around this:
- monitoring-ai-mcp: Completely skips JWT validation
- monitoring-mcp: Validates JWT but disables audience check (auth.py#L57)
The monitoring-ai-mcp server has JWT validation but completely disabled as a workaround:
Config: monitoring-ai/mcp/gondola.yaml#L82
MCP_JWT_AUDIENCE: "https://monitoring-ai-mcp-dev.tinyspeck.com/monitoring/v1/mcp"Implementation: Despite having MCP_JWT_AUDIENCE set, the server accepts any token without validation - it's a temporary workaround documented in the code comments.
The monitoring-mcp server has proper JWT validation enabled:
Config: monitoring-mcp/gondola.yaml#L63
MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp"Implementation: Actually validates JWT tokens (signature, issuer, expiration) and rejects invalid tokens.
Development mode option: The server CAN skip validation if MCP_JWT_AUDIENCE is unset (auth.py#L34-L36):
# Skip JWT validation entirely if no audience is configured (development)
if not self.expected_audience:
logging.info("Skipping JWT validation - development mode (no MCP_JWT_AUDIENCE configured)")
return await call_next(request)Match the monitoring-ai-mcp workaround approach for development environments.
Change: Update monitoring-mcp/gondola.yaml#L60-L63
environment:
ENVIRONMENT: "development"
PROMETHEUS_BASE_URL: "http://trickster-dev.internal.ec2.tinyspeck.com:9090"
# MCP_JWT_AUDIENCE: "https://monitoring-mcp.tinyspeck.com/v1/mcp" # Disabled for dev - no auth requiredPros:
- Simple one-line change
- Matches monitoring-ai-mcp behavior
- Works until proper OAuth is implemented
Cons:
- No authentication in dev environment
- Need to remember to re-enable for production
Keep MCP_JWT_AUDIENCE set but skip validation in code.
Change: Update monitoring-mcp/src/shared/auth.py#L46-L87 to match the monitoring-ai-mcp implementation:
try:
# Log token details for debugging (no validation)
# TODO: Pending MCP spec fix for RFC8707 audience handling
# https://slack-github.com/slack/toolbelt/blob/main/pkg/toolbelt/middleware/toolbeltauth/toolbeltauth.go#L58-L63
unverified_payload = jwt.decode(token, options={"verify_signature": False})
logging.info(f"JWT Debug - Allowing request for user: {unverified_payload.get('sub')} (validation skipped)")
request.state.jwt_payload = unverified_payload
request.state.user = unverified_payload.get('sub')
except Exception as e:
logging.warning(f"JWT decode error: {e}")
# Still allow the request even if decoding failsPros:
- Consistent behavior across both MCP servers
- Documents the workaround clearly
- Can still extract user info from token
Cons:
- Security-wise, accepting any token isn't ideal
- Need to update code in two places when the MCP spec is fixed
Remove the Authorization header for prometheus and let it fail auth, forcing monitoring-mcp to be updated.
Change: Update monitoring-ai-agent/monitoring-ai-slackbot/temporal/claude-worker/agent_config.yaml#L42-L46:
prometheus:
type: "http"
url: "http://monitoring-mcp.service.dev-us-east-1.consul:8000/v1/mcp"
# headers:
# Authorization: "Bearer ${PROMETHEUS_TOKEN:-orchestration-agent}"Pros:
- Forces the monitoring-mcp team to decide on auth approach
- Makes it clear auth isn't working
- Removes fake token that looks real but isn't
Cons:
- Doesn't solve the problem immediately
- Still requires monitoring-mcp changes
For immediate fix: Use Option 1 (disable JWT validation in monitoring-mcp dev environment)
For long-term consistency: Use Option 2 (match monitoring-ai-mcp's workaround approach)
For production: Wait for proper MCP OAuth spec implementation and Claude Code SDK support, then re-enable full JWT validation with real tokens in both services.
- Claude Code MCP Authentication Docs
- RFC 8707: OAuth 2.0 Resource Indicators
- Toolbelt Authentication Middleware
The PR that added nebula firewall rules also added documentation to agent_config.yaml about the requirements: monitoring-ai-agent PR #26
# IMPORTANT: When adding new MCP servers, you must also:
# 1. Update settings.json to allow the MCP tools (e.g., mcp__servername__toolname)
# 2. Update gondola.yaml nebula outbound firewall rules to allow connections to the service's nebula groups
# - Find the service's nebula groups: kubectl get pods -l app=<service> -o yaml | grep "slack.com/nebula.groups"
# - Add outbound rule with those groups to gondola.yaml under nebula.firewall.outbound