Back to overview
Downtime

Under DDoS

Feb 05 at 10:12am UTC
Affected services
Licensing API
Webhooks
Dashboard

Status Report Update State Resolved
Feb 06 at 08:04pm UTC

We were able to locate the source of the problem on our end that was limiting our scaling under certain loads. We've deployed a fix and are now accepting all traffic. If there's a silver lining, it's that we fixed a long standing performance problem that reared itself sporadically, and now we know why. We're going to write up a complete postmortem soon. Thanks so much for your patience, and I sincerely apologize for the problems this outage has caused all your businesses over the last 36 hours. It's been a nightmare, but we came out stronger.

Status Report Update State Updated
Feb 06 at 12:14pm UTC

Services have stabilized, but we are continuing to load shed the DDoS. We are continuing to monitor the incident and scale up infrastructure to handle the next traffic spike without downtime.

Status Report Update State Updated
Feb 06 at 10:06am UTC

We're aware of the same issue happening again. We're working on restoring services.

Status Report Update State Updated
Feb 05 at 09:01pm UTC

Today, we experienced an unintentional DDoS. It was not an attack as we originally reported (we were unaware of its nature). It was largely due to an unexpected increase in API volume, which caused a cascading series of events that ultimately caused an outage that lasted over 5 hours. This is unacceptable, and I apologize. The timestamps in this root cause analysis will be in CST.

At 12:30am CST, we experienced a large, sudden surge in legitimate traffic. This caused our autoscaler to kick in and scale up infrastructure. But because the source of the traffic surge was legitmate traffic, we were unable to rate limit effectively. By the time our infrastructure was scaled up enough to sustain the traffic, multiple systems had began failing due to a backup in job processessing. To make matters worse, the source of the traffic had an aggressive retry algorithm configured, where instead of backing off exponentially, it sped up, which effectively DDoS'd us with the volume of traffic.

Three hours elapsed before we were made aware of the incident. At 4am CST, we were able to page an engineer to investigate. This was a terrible response time, and we'll cover what went wrong here at the end. But at 4am, we were aware of and unraveling the incident, and formulating a response.

We noticed some outlier IPs i.r.t. volume, and we immediately began blacklisting the IP addresses that were sending higher-than-normal API volume. But that wasn't enough. And there were too many IPs. The volume was distributed, and as far as we could tell, it was legitimate traffic. This made things complicated.

To prevent further strain on our systems, we enabled Cloudflare's "Under Attack" mode, which effectively blocked the majority of API requests (since an API client cannot solve a Cloudflare "challenge response"). We wished there was another way, but Cloudflare's WAF is not very friendly to API-only products. We ultimately did this so that our systems could recover, and to give us time to determine next steps. Our Redis instances were at capacity due to the number of jobs to process, and this "pause" gave our systems time to recover.

Around 5am, once our systems were sufficiently recovered, we disabled Cloudflare's "Under Attack" mode. But the DDoS was still ongoing, and we determined the source of the traffic was originating from a single account. We began sending challenge responses to requests matching that account, effectively blocking traffic until we could deal with it in a more controlled fashion.

After getting in contact with the account owner, we determined that the traffic was indeed legitimate. We began letting in a percentage of the traffic until their applications were able to stabilize and cease retries (which was effectively the source of the DDoS).

We then began receiving reports that customers were experiencing intermittent signature verification failures. After investigation, this was due to Cloudflare modifying the Date header we send, which is a part of a response's cryptographic signature. It was intermittent because it was dependent upon the current time.

For example, let's assume a request has a timestamp of:

2024-02-04T23:01:10.993239

This results in us sending the following Date header:

Date: Sun, 04 Feb 2024 23:01:10 GMT

Unfortunately, the Date header is rewritten by Cloudflare to the following:

Date: Sun, 04 Feb 2024 23:01:11 GMT

Note the difference in the second portion. By the time Cloudflare sent the response, the Date had changed, because we were at milisecond .993239. This change in Date is what was causing the signature verification failures, since the Date header is used in the response signature to prevent timing attacks.

At 5am, service was restored for all accounts, and API volume was back to a normal level. At this point, we disabled Cloudflare's CDN to resolve the signature verification issue. Our request logs and event logs were delayed for another hour before normalizing.

It's unacceptable that it took over 3 hours to acknowledge the incident, and ultimately 5 hours of downtime, and I apologize for that.

I was the on-call engineer, and I failed that duty. A few things went wrong:

  1. Our pager service was not configured to send a phone call due to an oversight in the on-call configuration.
  2. The on-call engineer's phone was placed on DnD mode (do-not-disturb) at 12am CST.
  3. The pager service's text notifications were not whitelisted from DnD mode.
  4. The email alerts were not seen due to DnD mode being enabled (I was asleep).
  5. There were no escalation policies for paging more aggresively.

I sincerely apologize for this incident. I feel terrible. I feel defeated. I haven't slept. It's been a day of putting out fires in the aftermath, as I'm sure it has been for you. A lot of things went wrong, and I'm still trying to piece things together. But the above is what I do know now.

I'm looking at ways to prevent this from happening again. From scaling up infrastructure, to increasing datastore capacity, to finding a new WAF provider that is compatible with our API. I'm in talks with multiple providers.

I understand licensing is a vital part of a software business, and we regret this happened today. I'm sorry.

Status Report Update State Updated
Feb 05 at 03:44pm UTC

API availability has returned to normal, and response signatures should now be working as expected. The bulk of the traffic has been mitigated, but we are continuing to monitor the situation as overall request volume is still elevated. If another surge occurs, we will update this report. We have a lot of data in our backlog, so request/event logs and webhooks may be delayed.

Status Report Update State Updated
Feb 05 at 01:50pm UTC

We've employed Cloudflare to mitigate the issue, but this has had an effect on signature verifications. Due to the way Cloudflare operates, the Date header Cloudflare overwrites may not match the Date header Keygen sends, resulting in signature verification errors. If possible, use the Keygen-Date header for signature verification purposes, instead of Date, to mitigate this issue until we're able to resolve this issue with Cloudflare.

Status Report Update State Updated
Feb 05 at 11:27am UTC

We are still under an active DDoS attack, but we have put in place mitigations. Most API traffic has returned to normal, but we are still working with our infrastructure providers on a full resolution. Some accounts may still be inaccessible due to the nature of the attack. Thank you for your patience.

Status Report Update State Created
Feb 05 at 10:12am UTC

As of 23:48:02, we have been experiencing a major DDoS attack. We're working with our infrastructure provider on a resolution. Thank you for your patience.