Best Practices and Troubleshooting¶
Integration with Logging and Tracing¶
Metrics, logs, and traces work together for comprehensive observability:
Correlating Metrics with Logs¶
- Correlation IDs: Gateway components include request IDs in both logs and metrics
- Find Logs from Metrics: Use metric labels to filter logs (e.g., API name, route)
- Find Metrics from Logs: Copy request IDs from logs and query metrics with labels
Example: If you see errors in logs for a specific API, query metrics for that API:
Correlating Metrics with Traces¶
Metrics and traces share labels for correlation: - Trace IDs are included in log entries - You can search traces by API name or route - Span attributes include metric labels
Using All Three Stacks¶
Enable all observability profiles:
This provides: - Metrics: Quantitative measurements and alerting - Traces: Request flow and performance debugging - Logs: Detailed event information and error context
Best Practices¶
Development¶
- Use default scrape interval (15s) for reasonable granularity
- Keep short retention (7-15 days) to save disk space
- Enable debug logging for troubleshooting
- Use Grafana dashboards for real-time monitoring
Production¶
- Adjust scrape intervals based on traffic:
- Low traffic (<100 req/s): 15s interval
- Medium traffic (100-1000 req/s): 10s interval
- High traffic (>1000 req/s): 5s interval
- Configure appropriate retention:
- Short-term (hot): 7-30 days
- Medium-term (warm): 90 days
- Long-term (cold): 1+ years (use Thanos or remote write)
- Set up alerts for critical metrics:
- Error rate > 5%
- 95th percentile latency > 1s
- Memory usage > 80%
- Active streams approaching limit
- Use recording rules for frequently queried metrics:
- Monitor Prometheus itself:
- Scrape duration
- Rule evaluation time
- Storage usage
- Query performance
Security¶
- Restrict metrics endpoints in production
- Enable authentication for Grafana
- Use TLS for metrics endpoints (if exposed externally)
- Sanitize sensitive data from metrics
- Implement access controls for dashboards
- Regularly audit dashboard and alert permissions
Performance¶
- Optimize PromQL queries:
- Use rate() for counters over time ranges
- Use histogram_quantile() for percentiles
- Avoid high-cardinality labels (e.g., user IDs)
- Use recording rules for expensive queries
- Limit dashboard refresh rates (30s minimum)
- Prune unused metrics to reduce cardinality
- Compress metric names to reduce storage
Metric Cardinality¶
Avoid high-cardinality labels (millions of unique values):
Good (low cardinality):
Bad (high cardinality):
Query Optimization¶
Use time ranges:
# Bad: No time range (prometheus returns default)
gateway_controller_api_operations_total
# Good: Explicit rate over 5 minutes
rate(gateway_controller_api_operations_total[5m])
Use subqueries efficiently:
# Bad: Outer query has range, inner query has range
rate(rate(gateway_controller_http_request_duration_seconds_sum[5m])[10m:1m])
# Good: Single rate call
rate(gateway_controller_http_request_duration_seconds_sum[5m])
Troubleshooting¶
Metrics Not Appearing in Grafana¶
1. Verify metrics are enabled in configuration:
Ensure enabled = true.
2. Check Prometheus is running:
3. Verify Prometheus configuration:
4. Check Prometheus targets: - Navigate to http://localhost:9092/targets - Verify all endpoints are "UP" (green) - If endpoints are "DOWN", check: - Container is running - Port is accessible from Prometheus container - Metrics endpoint is responding
5. Test metrics endpoint directly:
curl http://localhost:9091/metrics | head -20 # Gateway Controller
curl http://localhost:9003/metrics | head -20 # Policy Engine
6. Check Grafana data source: - Navigate to http://localhost:3000/connections/datasources - Verify Prometheus data source is configured - Test connection should succeed
7. Verify network connectivity:
docker exec prometheus wget -O- gateway-controller:9091/metrics
docker exec prometheus wget -O- policy-engine:9003/metrics
High Cardinality Metrics¶
Symptoms: - Prometheus memory usage constantly increasing - Slow query performance - Many unique label value combinations
Diagnosis:
Solutions: - Remove high-cardinality labels (user IDs, session IDs, etc.) - Use histogram buckets instead of labels - Aggregate before labeling
Missing Metrics¶
1. Check if metric name changed (after component update)
2. Verify metrics are being scraped
3. Check component logs for metrics errors:
Grafana Dashboards Not Loading¶
1. Verify Grafana is running:
2. Check Grafana logs:
3. Verify data source configuration:
- Navigate to http://localhost:3000/connections/datasources
- Check Prometheus URL: http://prometheus:9090
- Test connection
4. Clear browser cache and reload dashboard
5. Re-import dashboards:
High Memory Usage¶
1. Check Prometheus memory usage:
2. Review retention settings:
3. Check metric cardinality:
4. Reduce retention:
5. Use Thanos or remote write for long-term storage
Slow Queries¶
1. Identify slow queries:
# Check query duration in Prometheus UI
# Navigate to http://localhost:9092/graph
# Run query and check execution time
2. Optimize queries: - Use rate() instead of raw counters - Use proper time ranges - Avoid high-cardinality labels - Use recording rules for common queries
3. Increase Prometheus resources:
Metric Values Not Updating¶
1. Check if metrics are counters with rate:
# Counter without rate (shows cumulative total)
gateway_controller_api_operations_total
# Counter with rate (shows rate of change)
rate(gateway_controller_api_operations_total[5m])
2. Verify scrape configuration:
3. Check component is receiving traffic: