Sergio Ammirata Ph.D., founder and chief scientist at SipRadius
We have always been proponents of self-hosting everything in-house, as it gives you full control and enhanced security. By everything, we mean email services, websites, general-use internal tools like Jira and Confluence, file repositories, and communication applications.
When self-hosting, disaster recovery is a critical consideration: Do we have recent backups for data, software, and hardware? What happens if our primary site goes down? Is there a secondary one ready to take over? If so, is it fully active and connected, with a load-balancing mechanism in place?
Being a small company, maintaining a real-time replica of all hardware, software, and data at a secondary site is a significant challenge, both financially and logistically. For us, the solution was a state-of-the-art Tier-3 datacenter, just a 30-minute drive from our office. The facility was equipped with giant generators ready to power the entire center at a moment’s notice, with enough fuel for any storm—even a Category 5 hurricane. It provided extraordinary redundancy in air conditioning, battery backups, and internet, giving us confidence that our infrastructure was well protected.
With this robust setup, we established a daily backup schedule within our datacenter and offloaded these backups to our office once per week. Recently, we even downsized our office infrastructure and retained only a subset of critical servers to keep costs down. We were sure that our primary site, with its 10Gbps internet connection and redundant power systems, was ready to face any challenge.
The Challenges We Faced During Hurricane Helene
Then came Hurricane Helene. Despite only being hit by the outer bands of the storm, which caused minimal damage to the area, an electrical panel feeding power to the datacenter caught fire. The fire department issued an order to shut down electricity to the entire facility, and soon, the UPS batteries drained as expected, leaving our rack—and the entire building—without power. This happened at 5:30 AM.
Attempts to reach the datacenter’s Network Operations Center were unsuccessful, as they had also lost power. The facility’s security systems, designed to keep the building safe, now kept everyone out. Their automatic door locks failed to open due to power loss, even though they were programmed to unlock only in the case of a fire. The situation felt surreal: no one could access their servers because the security systems were effectively doing their job too well.
By 9 AM, a large number of frustrated customers, including myself, gathered outside the building, seeking answers or, at the very least, access to our equipment. Finally, around 10:30 AM, power was restored, and the doors were unlocked. But much to our dismay, our racks still had no power, as the UPS units required a manual reset and no on-site personnel knew how to do it. The datacenter had to call in experts, and there was no clear timeline on how long this would take.
Faced with prolonged downtime, we made the decision to bring most of our servers back to our office and restore services from there. By the time the UPS units were fixed an hour and a half later, we had already mounted the servers and restored our infrastructure.
Rethinking Our Disaster Recovery Approach
This experience was a wake-up call. We had designed our disaster recovery strategy around the assumption that only an extreme event could bring down the datacenter. In reality, the weak link turned out to be power distribution. Most datacenters do not have redundant power circuits within their facilities, and redundant grid connections are even rarer. In the United States, datacenters are obligated to follow fire department orders, even if this means disconnecting power and leaving equipment offline.
Our strategy needed to change. Moving forward, our plan is to make our office the primary site, housing all the core hardware, while setting up smaller replicas in remote datacenters. These replicas will be primarily virtualized to reduce dependency on additional hardware. We realize now that having direct control over our main infrastructure is key, allowing us to respond quickly in case of an issue. Remote replicas will serve as backups, kept in sync with the primary site.
Practical Insights for Infrastructure Resilience
We understand that our experience may resonate with other small tech companies aiming for infrastructure resilience in an increasingly complex world. Here are some key lessons we learned:
- Datacenter Limitations: Even Tier-3 facilities can have weaknesses. Redundant power distribution and accessibility during outages are crucial factors that are often overlooked. And even when they have systems in place, they might not have personnel on-site trained and ready to act in a moment of crisis.
- The Importance of Direct Control: Self-hosting offers control, but without the right redundancy strategy, that control can easily become a vulnerability. Balancing in-house infrastructure with distributed backups can provide a more resilient solution. You always have control of your people, not so much of the people at the datacenter.
- Rethinking Redundancy: True resilience isn’t just about relying on sophisticated infrastructure. Understanding the limits and potential points of failure is essential in planning for continuity. Especially the limits that the human factor brings to the equation.
Adapting for the Future
We have adapted our approach, and in doing so, learned that flexibility and a willingness to reassess strategies are just as important as the technology itself. Hurricane Helene showed us the gaps in our plans, and by sharing our experience, we hope to encourage others in the industry to rethink their assumptions and improve their resilience.
Even the best-laid plans can face unexpected challenges, but with the right mindset and a commitment to learning from each experience, infrastructure can become more adaptable to the uncertainties of an increasingly challenging world.