Incident Overview¶

This section explains how Choreo automatically detects, analyzes, and helps you manage component incidents. The incident management feature provides automated root cause analysis to help you quickly understand and resolve issues affecting your components.

Tip

Incidents are automatically created when critical system events are detected in your components. You don't need to configure anything. Choreo monitors your components and creates incidents when issues occur.

What are Incidents?¶

Incidents are automatically generated alerts that indicate your component has experienced a critical issue affecting its stability or availability. When an incident occurs, Choreo automatically:

Creates an incident record with detailed information
Collects relevant logs and metrics from before and during the incident
Analyzes recent deployment source code and configuration changes

This helps you quickly identify what went wrong and how to fix it.

Incident Types¶

Choreo automatically detects and tracks the following types of incidents:

OOMKilled incidents
CrashLoopBackOff incidents

OOMKilled Incidents¶

OOMKilled Incidents occur when your component ran out of memory and was terminated by the system.

Common causes:

Memory leaks in your application
Memory allocation is too low for your workload needs
Unexpected traffic spikes causing memory pressure
Large data processing without proper resource management

CrashLoopBackOff Incidents¶

CrashLoopBackOff Incidents occur when your component keeps crashing and restarting repeatedly.

Common causes:

Application fails to start properly
Missing or incorrect configuration values
Unable to connect to required dependencies (databases, APIs, etc.)
Code errors that cause the application to crash immediately on startup

View Incidents¶

Accessing the Incidents Page¶

Navigate to the component you want to monitor.
In the left navigation menu, click Observability and then click Incidents.

Note

You can view incidents from project level or component level.
The incidents page displays all detected incidents for your project or component based on where you view the incidents.

Filtering Incidents¶

Use the filters at the top of the incidents page to find specific incidents:

Time Range: Select a date range to view incidents from a specific period
Environment: View incidents from specific environments (e.g., Development, Production)

Understanding Incident Details¶

Click on any incident to view comprehensive diagnostic information.

Incident Summary¶

At the top of the incident details page, you'll see:

Incident Type: What kind of issue occurred (OOMKilled or CrashLoopBackOff)
Incident ID: Unique identifier for the incident
Time of Failure: Exact date and time of the incident

Incident Analysis Sections¶

Once you open an incident, you'll find four key sections that provide comprehensive diagnostic information to help you understand and resolve the issue:

Section	Description
Compare Source Code	Analyzes code changes between the incident version and the previous stable state.
Compare Configurations	Highlights changes in environment variables or resource allocations.
Logs	Displays filtered logs and events captured at the time of the incident.
Metrics	Shows resource usage and performance metrics leading up to the incident.

1. Compare Source Code

Analyzes code changes between the incident version and the previous stable deployment.

What you'll see: - Side-by-side code diff showing what changed between versions - Specific files and lines that were modified along with the Commit diff Link of the provider - A link to view the Commit diff in the respective git provider.

Note

If the Commit diff contains more than 5000 characters, only the Commit diff link will be shown.

Tip

If the incident occurred shortly after a deployment, carefully review the code changes they often reveal the root cause.

2. Compare Configurations

Highlights changes in environment variables, secrets, and resource allocations (CPU/Memory).

What you'll see: - Configuration differences between the current and previous deployment - Changes in environment variables - Resource allocation modifications (CPU and memory limits) - Secret and configmap changes

Important

For OOMKilled incidents, check if memory limits were reduced. For CrashLoopBackOff, verify that all required environment variables are correctly set.

3. Logs

Displays filtered logs and events captured 5 minutes before and 20 seconds after the incident occurred.

What you'll see: - Application Logs: Your application's output and error logs - System Logs: Container lifecycle events (restarts, crashes, OOMKilled events) - Gateway Logs: API Gateway access and error logs (if applicable) - Up to 200 log entries per log type, helping you pinpoint exactly what happened

Note

Logs are automatically collected from 5 minutes before the incident to 20 seconds after, ensuring you have context before and during the failure.

4. Metrics

Shows resource usage and performance metrics leading up to the incident.

What you'll see: - Memory and CPU Usage: Trends showing resource consumption over time - Request Rates: Number of requests per minute - Response Times: API latency and performance metrics - Error Rates: HTTP error status codes and failure rates

Use metrics to: - Identify memory leaks (steadily increasing memory usage) - Spot CPU spikes that may have caused issues - Correlate traffic spikes with the incident - Understand performance degradation patterns

Best Practices¶

Prevent OOMKilled Incidents¶

Set appropriate memory limits based on your component's actual usage patterns
Monitor memory trends regularly to catch gradual increases
Implement proper memory management in your code
Add alerts for high memory usage before it reaches the limit

Prevent CrashLoopBackOff Incidents¶

Test deployments in non-production environments first
Validate configuration before deploying
Implement health checks in your application
Ensure dependencies are available before deploying

General Best Practices¶

Review incidents regularly: Don't wait for critical issues—proactively review and address incidents
Follow recommendations: The root cause analysis provides actionable steps—follow them
Track patterns: If similar incidents occur repeatedly, investigate deeper systemic issues
Update resource limits: Adjust CPU and memory limits based on incident insights
Keep deployments small: Smaller, incremental deployments make it easier to identify what changed when incidents occur