Langfuse @ AI Engineer ยท Booth LG-39 โ†’
HandbookIncident Response

Langfuse Incident Response Plan

Declaring an incident

Any Langfuse or ClickHouse team member can declare an incident at any time โ€” don't wait for certainty. When in doubt, page.

Langfuse incidents are handled in the ClickHouse incident.io account.

  1. Use /inc in Slack or declare the incident from incident.io.
  2. Make it clear in the title and description that this is a Langfuse incident.
  3. Include the affected service(s), region, customer impact, and useful links to errors, dashboards, traces, or support tickets.
  4. Select the highest plausible severity โ€” this is not published externally.

For urgent issues, use /inc escalate to page Langfuse directly. You can also tag @langfuse-oncall in Slack. ClickHouse teammates can find the team in #langfuse; if the issue is related to ClickHouse instances, also involve #ch-sre-team.

incident.io will automatically create a Slack channel and route the escalation to the Langfuse on-call team. Salesforce support escalations should use the same Langfuse on-call flow.

When to page: platform outages, ClickHouse instance issues affecting Langfuse, security issues, elevated errors, a broken core product flow, or a customer seeing another customer's data.

Response

The first engineer to join the incident channel is the Incident Lead. Assign yourself the role in incident.io. Pull in DRIs of affected components, ClickHouse SREs, or Max if needed. For customer-facing incidents, pull someone from the business side to monitor Slack channels and support tickets.

  1. Triage โ€” Collect evidence (screenshots, metrics, logs), publish a status page update (see below). For critical incidents, enable the product announcement banner.
  2. Mitigate โ€” Restore the system first: rollback, scale up, feature-flag, hotfix. Root cause comes later.
  3. Stabilize โ€” Mark as mitigated in incident.io, update the status page, monitor for 15โ€“30 min, then dissolve the call.

War room call

Keep all incident communication in the incident.io war room call so remote teammates can join quickly and we have a transcript for the post-mortem.

  1. Open the incident Slack channel created by incident.io.
  2. Click โ˜Ž๏ธ Join the call in the incident.io message.

Status page

Status pages are extremely important โ€” they are our mechanism to show transparency to users, which builds trust. When in doubt, always set up a status page.

The following should always have a status page:

  • Eval execution delays
  • Ingestion delays
  • Errors/latencies on public APIs
  • Login issues

To publish: go to "Status Pages" in incident.io, select our public status page, and hit "Publish Incident". Declaring an incident via /inc in Slack does not automatically create a public status page update โ€” you must publish it separately.

The incident lead keeps the status page up to date with concise and accurate information throughout the incident.

Post-mortem

After mitigation, find and fix the root cause. Complete the post-mortem in Linear using the auto-generated timeline, covering: summary, impact, root cause, contributing factors, and action items with owners. Track follow-ups in the Linear ticket. Share in #team-engineering.


Was this page helpful?

Last edited