502 Error USA Data Center

Incident Report for AskNicely

Postmortem

The 502 error

Today a number of customer may have experienced a 502 error and were not able to access the AskNicely platform.

We are super proud of the platform we have built, and when we let our customers down, we know we need to do a better job, it really hurts. We are sorry you were not able to access our platform. Very sorry. We have a fantastic engineering team and over the next week, we will be focusing on our infrastructure to help minimise outages that you may have seen today.

What went wrong

AskNicely is built on AWS (Amazon), it is an amazing platform which allows to scale our solution very easily. Today we hit an issue with extremely heavy load on our USA database server (RDS). The symptoms we saw.

502 Error rates
Load Balancer errors, 'unhealthy web server in load balancer pool
Database load in RDS going from under 5% to 100% in matter of seconds. Very abnormal.
Our 502 error page did not tell our customers what was happening, nor link to our status page. Bad.

What went right

We have extensive monitoring on AskNicely we have some fantastic services that we love which kicked in as soon as it detected something abnormal. The services we use today:

PagerDuty.com We love PagerDuty, both the mobile app, email, SMS and automated phone calls for alerting. Auto escalation policies to other team members.
Datadog.com provides us with detailed metrics around our application performance and servers, we send a massive amount of data back to Datadog and its a valuable asset that we use for real time monitoring and debugging.
Loggly.com all our log files and error logs are managed in Loggly. We can easily visualise and quantify requests from customers in seconds using their powerful log query tool.
NewRelic.com can provide incredibly detailed analysis of what parts of our application are being used the most, how well that code is performing and what part of the code is the slowest. It also monitors how long our application is taking to load for our customers. We really absolutely love NewRelic and it is our Litmus test to see if our code changes have resolved our issues or not.
Slack.com it makes it so easy for our team to stay on the same page and communicate instantly no matter where we are in the world.
Statuspage.io You can find a link to our statuspage from the www.asknicely.com homepage and our 404 pages.

What we discovered

During this time, we came under a very heavy API load from one customer. Normally our API rate limiter would kick in and prevent any one single customer from causing an outage. But due to the size of this customers dataset, our API was too slow to respond to all their requests causing massive congestion. Our rate limiting API is tuned for number of requests, not time to process a request.

What we did

We have a number of strategies that we use to scale our platform. One strategy allows us to move a single customer from one database host (RDS Instance) to another. Once we isolated the issue, this customer was moved to their own database instance. The AskNicely application instantly become responsive and all our server metrics returned to what we would consider normal parameters.

We have also worked on several bottle necks including:

Autoscaling our primary USA database server, we have tripled the capacity of this server, in size and dedicated IOPS.
We have 6x our Redis instance that provides us with a powerful and fast caching service for parts of the application.
We have changed several variables on our RDS instance that would allow higher loads
We have added another application server to the server pool.

What we are planning todo

Add detailed API monitoring - time, frequency, tenant and database
Improve our API rate limiter.
Refactor our API code that caused us issues and most likely refactor a particular query that caused the heavy load on our database.
Provide a way to gracefully degrade AskNicely so that core/key services are not affected.
Improve our 502 error page to link to our StatusPage so we can get our customers more timely updates.

Again we are sorry, and we are working hard to rectify these issues.

John // CTO and co-founder AskNicely

Posted Oct 09, 2018 - 20:40 NZDT

Resolved

This is issue is now resolved. We have made several changes that have identified the root cause and rectified these issues. We will continue to monitor over the next several days.

Posted Oct 09, 2018 - 12:18 NZDT

Update

We are continuing to monitor, we have made a significant change that appears to rectify the issue.
Again, we are monitoring this and we will do a debrief today.

Posted Oct 09, 2018 - 09:44 NZDT

Monitoring

We have identified an issue and are now monitoring.

Posted Oct 09, 2018 - 04:25 NZDT

Investigating

We are investigating a 502 Error on the US datacenter, we have several engineers looking into the issue.

Posted Oct 09, 2018 - 04:13 NZDT

This incident affected: AskNicely Application.