Database service disruption today
Resolved
Oct 03 at 01:44pm PDT
Hello everyone. Unfortunately we had a period of about 4 minutes today during which our main database was unreachable. It occured between 12:19-12:23 PST.
Even though the database was unreachable for only a short time, the impact was significant, especially if you happened to be in the middle of a case.
What was the impact?
If you were not signed in during the incident, there is no known impact to your account or your data.
If you were signed in, you would have likely found yourself suddenly redirected to the sign in page and during that period, you would not have been able to immediately sign back in again.
Additionally, even though the database was only "off" for a short period, it does take these systems a bit of time to spin up again. So it could have been a bit more than 4 minutes during which you would not have been able to log in.
This is not the type of experience we want our customers to have with our product.
We're sorry for the disruption.
How did it happen?
Even though each organization's clinical database is split into their own data stores, there is a shared database for handling user accounts, etc. That shared database is what became unreachable.
- We were shutting down an old, unused cloud server.
- We didn't know that Google Cloud used to link certain types of servers to certain database services, and we didn't know that our setup fit that criteria.
- When we disabled the old/unused server, unbeknownst to us, Google Cloud also automatically disabled the entire database.
What did we do to fix it?
As soon as we realized something was wrong after making the change, we immediately reenabled the service that we just shut down.
This restored functionality. The problem was, that anyone who had been signed in, would have been automatically signed out the next time your session was renewed (which is every 3 minutes).
Why did we want to shut it down? And why do it during the day while there are surgeries going on?
We shut it down because it was a housekeeping item raised during our recent security audit.
We shut it down, even in the middle of the work day like this, because it was unused and we were not aware there was this hidden link.
What could we have done to prevent this?
The fact that disabling this old service would have any impact on our database was hard to predict.
But that doesn't change the fact that it is our responsibility to keep things running as best as we can.
We will take steps to carefully unlink these systems before we attempt to remove the old service again, and when we do, it will be during off-hours.
I take full responsibility for the incident.
My personal apologies to those who were affected.
If you were negatively affected or you'd like to discuss this further you can email me directly: henrik@xchart.com and I'll do my best to help however I can.
Henrik Joreteg
CEO Xchart, Inc.
Affected services
Xchart Internal API
Case Editor
Case Manager