Shift Database Outage

Main Thread 4 min read

It was 1:28pm. I received the following notification:

SQLSTATE[42S02]: Base table or view not found: 1146 Table 'orders' doesn't exist

I ignored it at first as sometimes the Shift workers drop the database connection. Then I received another. And another. And another.

I went to laravelshift.com. 500.

I tailed the logs and this error was streaming in. Given the frequency I figured MySQL locked up. So I killed the web server to stop requests coming in and allow MySQL to catch up.

I checked the server stats and everything was normal. I quickly made a database backup incase MySQL did crash. Then I restarted the web server.

Same error.

I killed the web server again and looked at database. The tables were indeed missing.

I went to restore the database from the recent backup. But, I had overwritten it in my initial assumption MySQL had locked up. This forced me to restore from the weekly backup.

Given the recent release of the Laravel 7, this is high season for Shift. So that meant a data loss of roughly 600 Shifts.

Now all these Shifts were completed. So the data loss wasn’t an immediate issue. But it was a poor experience. Users like seeing recent Shifts in the dashboard, as well as being able to create invoices.

So I spent most of Wednesday evening using Stripe data and the Shift run log to rebuild as many of the orders as possible.

Thursday morning I sent out an email to users letting them know if they were missing a Shift, to reach out to support to help get it restored.

At this point it seems like there’s only a few dozen orders missing.

In the end, not a terrible outage. But I learned a few things. This is my postmortem.

Don’t panic

I made assumptions. While these were based on experience, I also got sucked into the moment.

The information was in front of me - the orders table was missing. Yet this was one of the last things I checked.

I was focused on getting the site back up. Shift is my livelihood. If it's down, I have no income. So that was my top priority.

That's fair. But I went rogue. I started throwing commands at the problem instead of following a recovery plan. In doing so, I made the situation worse.

Root Cause Analysis

As I shared in my postmortem on the shifty email bug, it's important to ask why.

It was clear the initial cause was missing database tables. But why were they missing?

I had a pretty good idea that this was from creating a new Shift worker.

Part of the process of creating that worker is for it to build its own version of the Shift application using artisan migrate:fresh.

That would cause the tables to be missing. But, it didn’t completely explain why. For instance, why did migrate:fresh run against the production database and not its own database?

That was the real question. If I didn’t answer that, it would only be a matter of time until this happened again.

It took a little reflection, but I remembered I ran into an issue with environment variables when building Shift for Docker. Environment variable were being inherited. I remember it seemed odd in the moment, but then making sense from a system process perspective.

Since Shift spawns a new worker process, the worker initially receives the environment variables from the parent process. In this case, the web application.

As such, early commands have access to these environment variables. Even though I use the --env option, variables were still being merged.

What’s interesting, is that I didn’t notice this for the other worker. It turns out this only happens in a fresh environment with uncached configuration.

That is if I cache the Laravel configuration with artisan config:cache, this behavior does not occur.

I didn't want to rely on this behavior though. So the ultimate solution was to ensure that I spawned a fresh process. One that did not inheriting the environment variables from the parent process.

I could do this a few ways, but the easiest was to pass the third argument to Symfony's Process object.

$process = Process::fromShellCommandline(
    $command,
    null,
    Automation::ignoredLaravelEnvVariables()
);

Support and Transparency

Even though I recovered from this outage quickly and did root cause analysis, I still felt bad about the data loss.

Some might have said, "Oh well, their Shift already ran. It's fine."

That didn’t feel right to me. I pride myself on support. I believe Shift has a high value. I didn't want this outage might hurt its value.

That's why I spent time writing scripts to rebuild the missing orders. I sent out an email inviting users who were still missing orders to reply.

I still wasn't able to recover everything. But I felt better taking actions to resolve the matter as best as I could.

It also turned out to be something users really appreciated. Several responded to the email with understanding and praise.

In the end, I think the transparency went a long way to actually add value and build trust.