Working in a tech start-up can be exhausting. It’s very rewarding, but exhausting. With a plethora of day-to-day tasks, one can easily become too involved in the details. We are still very much discovering what our ideal product looks like and it’s very important to keep the ability to dream and define our grand-vision for the product. And to be honest, I’ve lost or at least inhibited this ability in recent months. A disconnected holiday break was very much needed to recharge the batteries and refill on elan.
At the moment, I’m the only person daily involved with tech at our company, which heavily relies on our platform to generate monthly revenue. We’re paid almost exclusively based on our performance, so we have to perform each month if we want salaries the following month. Hence, it’s critical our platform is operational.
In this article, I want to explore the methods that assured me a break, where I could disappear for 2 weeks, (almost entirely) disconnect and trust I would still have a company to return to.
Prevention
When it comes to the production software stability, I’m very much a fan of saying: “defence is the best offence”. I strongly believe clean intelligent design and testing can prevent most disasters. Not everything, as there will always be cases that were not predicted, especially as the software scales from processing thousands to millions of rows of data per hour. Over-testing and over-stressing can also significantly harm the innovation rate, so I have to be mindful not to overdo it as the law of diminishing returns definitely applies here.
Good engineering practices: strong type system and automatic testing
I’ve already written an article about how we chose our tech stack and why, so I don’t intend to repeat myself at length here. As noted in my previous article, using a strong type system, linters, writing tests and automating build and deployment pipelines is one of the crucial components of production stability.
Try to break it
One of the things that interests me is also computer security. Learning how to break (hack) software is one of the skills which helps enormously when building fault-tolerant applications. I try to incorporate into my development process a habit of thinking about how I could break code if I needed to. This helps me to uncover edge cases and underlying assumptions and enables me to stress test them deliberately.
Automatic retries
I don’t fool myself into thinking errors will never happen. With web applications, there are a million things that can fail, some of which are entirely out of our control. A web connection might be dropped, a power outage might occur, etc. Therefore, I think it’s vital that we accept this uncertainty in advance and design for it.
Ability to restart and continue
Our application is very much stateful and it heavily relies on the state stored in the database. One thing we rely heavily on is asynchronous scheduled tasks. These tasks perform most of the logic: from sending emails to updating the database.
The scheduling and execution of tasks is quite reliable and predictable, but there can be downtimes. I make sure that tasks are designed with no hidden assumption about when it was last executed. When a task starts running, it checks the state in the database and continues where it last finished.
Example from our production: we schedule emails we’ll be sending a day in advance. Each scheduled email has a “wait_until” field with a datetime set for when we want to send an email. The sending task sends all emails that need to be sent and have “wait_until” smaller than the current time. Normally, this is scheduled and executed so that the actual sending time is around the actual “wait_until” value. If there is downtime, however, the task automatically catches up the next time it’s successfully run.
Have you tried turning it off and on again?
Even the most reliable software and infrastructure sometimes fail. For infrastructure failures, we have redundancy in our cluster so that the control plane can reshuffle containers onto a working node. We also make sure to always implement meaningful health and liveness checks for our containers in Kubernetes. This works exceptionally well and auto-magically eliminates a lot of problems, eliminating the need for manual intervention.
Error logging and notifications
This one is quite straightforward: it’s important to have easy visibility into what is happening in production. On the most basic level, this means at least error logging. We have two log monitoring systems in production: one using Sentry and the other based on Promtail, Loki, and Grafana. I became a huge fan of Sentry and would definitely recommend trying it out, if you haven’t already. It has excellent out-of-the-box integrations with most of our tech stack and requires almost no effort to set up an outstanding error logging system.
Set-up notifications
I have connected Sentry with Slack to get instant notifications about errors in production. I know this might sound counter to my mission to disconnect, but it’s actually a critical part. I still needed to ensure everything was working while I was away, and notifications meant I didn’t have to log in and manually check if everything was running okay. If something happened, I knew I would be notified.
The Result
We’ve had two minor downtimes during my holiday due to an issue with too many connections being left open to our database. In both cases, I was notified and managed to resolve the issue (with the help of Nejc) within 15 minutes, which is not bad, considering I was at the beach both times. Altogether, I’ve spent less than 45 minutes of my break actually working. The only thing I needed to ensure was to have a phone on me, in case something happened. I managed to completely disconnect on 12 out of 14 days of my absence.