Xchart Internal API is down

Downtime

Xchart Internal API is down

Aug 11 at 10:50am PDT

Affected services

Xchart Internal API

Resolved
Aug 11 at 02:15pm PDT

Hey everyone, as some of you noticed, Xchart experienced some downtime today.

You can see it as a red blip here on our official status page. Access to our API was unavailable to some of our customers for up to an hour. It took a total of 1 hour and 7 minutes to fully resolve according to our status monitoring.

Today's outage is the single longest outage we've experienced to date.

For this, we are truly sorry.

The last time a core part of the service was unavailable in the last 90 days was a 5 minute stretch where servers in Asia, Europe, and Australia were unable to reach our API.

I wanted to take the time to write up an explanation of what happened. Because frankly, it's not what we want our customers to experience.

What was the problem?

It was caused by a missing DNS record.

The DNS system is how a browser looks up a domain name like: xchart.com and determines what server to ask for the data from.

So the domain for our internal data api, wasn't resolving because there wasn't a record there to be found. If you were in the affected group, your computer wasn't getting anything back at all from our data API because it couldn't reach it, at all. Our servers were running just fine, but the request from your computer just wouldn't even have been sent to our servers.

Why was the DNS record missing?

This is the part that gets frustrating to us.

We made, what should have been an inconsequential change.

We were trying to edit the DNS record for our internal API to have a shorter cache time. This was a small pre-requisite step in preparation for the actual change which was planned to happen over the weekend. This seemed like a small enough change that we felt comfortable doing it during normal business hours. Plus, it's something you want to do before the actual change so that old caches with the old time have a chance to clear out.

The web interface we're using to make these changes doesn't let us directly change a record in place. Instead, you have to create a new one, and delete the old one.

We didn't want to risk there ever not being one there, so we created the new duplicate one first, then subsequently deleted the old one.

We made the change, everything seemed to work fine, no big deal, no problem, it seemed to have worked exactly as expected.

But then our monitoring tools started sending alerts that the system was unreachable.

We immediately knew something bad had happened because this should've been a totally harmless change and according to the screens we were looking at a record had remained in place the entire time.

We tried undoing the change and replacing the old record and we tried with various tools to see if the record was properly in place, but several of them kept saying it was still not there.

At this point, we desperately tried to contact support at this vendor. Thankfully they got back to us reasonable fast with this simple response:

"Hi, This should now be resolved. Please re-try."

We pointed out the issue still was intermittently happening and the explanation was:

"It should eventually stabilise. Since the change has been made from our end just now, it could take some time."

and when we asked what the cause was, we got this:

"The DNS record was saved in our database, but not on our DNS provider. I added it there."

It turns out, that the web interface we were using to make this change was lying to us. It was showing that the change had been successful when in reality it wasn't.

Wow.

Once it was actually in place, we spent the next half hour just trying to confirm from various sources that things were in fact working, and notifying the customers who had reached out to us, to "please try it again."

What we learned, and what we're going change

What worked well:

Having a status page that auto-updates based on its current tests against our production systems.
Our monitoring alerted us of the issue long before any customer did.

What failed:

The DNS service provider we use showed us incorrect information. The record was missing, but it said it wass in place.
Despite the seemingly innocuous nature of the change, we likely should not have been makin DNS changes during normal operating hours.

What we'll do different:

DNS changes, no matter how innocuous should be reserved for weekends or at least off hours.
We will switch to a differen DNS service provider. Most likely we'll move this to Google Cloud as well. That's where the majority of our infrastructure is running already and we have been extremely impressed with their overall reliability with their other services we use.

A Personal Apology

As CEO the buck stops with me. I'm sorry for what happened today. If you're going to blame anyone, blame me.

We're grateful for your trust and continued business. We will continue to strive to build the best possible anesthesia charting system we know how.

Stay Paperless,

Henrik Joreteg, CEO Xchart, Inc.

Updated
Aug 11 at 12:00pm PDT

Xchart Internal API recovered.

Created
Aug 11 at 10:50am PDT

Xchart Internal API went down.