Slow platform due to high latency
Incident Report for ZapSign
Postmortem

Date: 30/04/2024

Authors: André Chaves, Eduardo Milhomen, Douglas Ferreira

Status: Solved

Impact: Partial latency on document creation

Summary: On April 30, 2024, at 08:43 (UTC-3), our team detected significant latency issues in document creation endpoint. This problem caused some requests to continue processing beyond the expected timeframe

Root Causes:

The primary cause of the latency was that requests were held for an extended duration without timely processing.

Trigger:

The issue was initially identified through an alert from DataDog, which reported excessive latency affecting Itau Endpoints.

Resolution:

To resolve the issue, we scaled up the number of pods handling the traffic and implemented a strict time limit on the endpoint. Now, no request is held for more than 60 seconds. This adjustment has stabilized the document creation process and resolved the latency issues.

Lessons Learned

What went well

The monitoring application alerted us of the latency promptly.

What went wrong

We took more time than expected to identify the root cause of the problem.

Where we got lucky

Despite the significant latency, the impact was confined to partial delays on document creation rather than a complete system outage.

Timeline

  • 30/04/2024 08:43 (UTC-3): We detected a high latency on some endpoints of the application
  • 30/04/2024 09:00 (UTC-3): Changes implemented on scalability of the application
  • 30/04/2024 09:40 (UTC-3): Another occurrence of high latency
  • 30/04/2024 09:44 (UTC-3): Change implemented on endpoint
Posted Apr 30, 2024 - 13:02 GMT-03:00

Resolved
On April 30, 2024, at 08:43 (UTC-3), our team detected significant latency issues in document creation endpoint. This problem caused some requests to continue processing beyond the expected timeframe
Posted Apr 30, 2024 - 13:01 GMT-03:00
This incident affected: API.