Best Practices and Troubleshooting¶
Best Practices¶
Development¶
- Use 100% sampling rate (
sampling_rate: 1.0) - Enable debug output in OTLP collector
- Use Jaeger for quick trace visualization
- Keep trace data for 1-7 days
Production¶
- Use managed services (Datadog, New Relic, etc.) to reduce operational overhead
- Implement appropriate sampling (1-10% depending on traffic volume)
- Enable TLS for OTLP connections
- Set resource limits on OTLP collector
- Monitor collector health and performance
- Implement trace retention policies based on compliance and storage costs
- Use tail-based sampling to keep important traces (errors, slow requests)
Security¶
- Enable TLS for trace transmission
- Sanitize sensitive data from trace attributes
- Implement proper access controls for trace viewing
- Regularly audit who accesses trace data
- Consider data residency requirements
Performance¶
- Use appropriate sampling rates to balance visibility and overhead
- Configure batch settings to optimize network usage
- Monitor gateway component overhead from tracing
- Use asynchronous trace export (default with OTLP)
- Consider using tail-based sampling for high-volume environments
Sampling Strategy¶
Choose sampling based on traffic volume:
| Traffic Volume | Sampling Rate | Use Case |
|---|---|---|
| < 100 req/s | 100% (1.0) | Full visibility, low overhead |
| 100-1000 req/s | 10-50% (0.1-0.5) | Balanced visibility and cost |
| 1000-10000 req/s | 1-10% (0.01-0.1) | Cost-effective, statistical sampling |
| > 10000 req/s | 0.1-1% (0.001-0.01) | Minimal overhead, error sampling |
Note: Always use 100% sampling for errors using tail-based sampling.
Troubleshooting¶
Traces Not Appearing in Jaeger¶
1. Verify tracing is enabled in configuration:
Ensure enabled = true.
2. Check OTLP Collector is running:
3. View OTLP Collector logs:
Look for connection errors or export failures.
4. Check Jaeger is running:
5. Verify network connectivity:
6. Check gateway component logs for trace export errors:
Traces Are Incomplete or Missing Spans¶
1. Check sampling rate - ensure it's not too low 2. Verify all components are configured to export traces 3. Check for trace context propagation issues - ensure headers are preserved 4. Look for timeout errors in OTLP collector logs
High Trace Export Overhead¶
1. Reduce sampling rate:
2. Increase batch size:
3. Use tail-based sampling in OTLP collector to sample only important traces
Traces Have Incorrect Timing¶
- Ensure system clocks are synchronized across all containers (use NTP)
- Check for clock skew in trace timeline view
- Verify trace context propagation is working correctly
Cannot Access Jaeger UI¶
1. Verify Jaeger is running:
2. Check Jaeger logs:
3. Ensure port 16686 is not blocked:
Integration with Logging¶
Traces and logs work together for comprehensive observability:
Correlating Traces and Logs¶
- Trace ID in Logs: Gateway components include trace IDs in log entries
- Find Trace from Log: Copy trace ID from log entry and search in Jaeger
- Find Logs from Trace: Copy trace ID from Jaeger and search in log viewer
Example log entry with trace ID:
{
"level": "info",
"ts": "2025-12-19T10:30:45.456Z",
"msg": "Policy executed",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"policy": "modify-headers"
}
Using Both Stacks¶
Enable both logging and tracing profiles:
This provides complete observability: - Traces: Request flow and performance - Logs: Detailed event information and debugging