Migrate from Public Cloud: Building Kubernetes Bare-Metal Infrastructure
With increasing cloud costs and the growing need for control over infrastructure, many organizations are exploring on-premises solutions. For us, the solution was Kubernetes on bare-metal servers—a challenging but rewarding journey
Introduction
With increasing cloud costs and the growing need for control over infrastructure, many organizations are exploring on-premises solutions. For us, the solution was Kubernetes on bare-metal servers—a challenging but rewarding journey. This article takes you through every step of the process, the challenges faced, and how specific tools played a crucial role in our success.
If you're considering moving workloads off the cloud while maintaining flexibility and scalability, this detailed technical guide is for you.
Why Kubernetes on Bare-Metal?
Kubernetes is often touted as the go-to solution for container orchestration. However, in this case, we used it for more than managing containerized applications. Kubernetes served as a fleet management system, taking responsibility for server-level tasks traditionally handled by separate tools.
This approach allowed us to:
- Reduce reliance on cloud services and cut costs by more than 70%.
- Centralize management of physical hardware and workloads.
- Build an infrastructure optimized for high control and performance.
Despite the clear benefits, deploying Kubernetes on bare-metal is complex. Typical challenges include provisioning servers, managing distributed storage, and ensuring reliable networking. Here's how we tackled these obstacles.
Core Tools and Technologies
To streamline the deployment and operation of Kubernetes on bare-metal, we leveraged several powerful tools. Below are the core technologies that made this possible.
Talos Linux
Talos Linux by Sidero Labs is a specialized Linux distribution designed exclusively for running Kubernetes. Unlike general-purpose operating systems, Talos:
- Is lightweight (less than 50 MB)
- Operates entirely through an API, eliminating the need for a shell
- Reduces the attack surface, significantly improving security
Here’s how Talos works in practice:
- Installation: Talos requires you to define your cluster configuration in a declarative YAML file. Using
talosctl
, you can provision the OS on your bare-metal nodes, turning them into Kubernetes-ready machines. - Management: All interactions with Talos occur via the API, which you can access through
talosctl
. This eliminates SSH access, further securing the nodes.
We assigned a junior DevOps engineer to handle the setup. They were able to install a fully functional Kubernetes cluster on our rack colocated servers with minimal issues, demonstrating how user-friendly Talos is despite its cutting-edge design.
MetalLB for Load Balancing
Load balancing in bare-metal Kubernetes clusters is notoriously tricky because you don’t have the managed load balancers provided by cloud providers. We used MetalLB to address this.
How MetalLB Works:
- Layer 2 Configuration: MetalLB operates in Layer 2 mode, where it listens for ARP requests and responds with the appropriate IP address. This approach allowed us to utilize public IPs mapped to our servers effectively.
- Integration with Kubernetes: By installing MetalLB as a Kubernetes operator, we could define IP address pools in a ConfigMap. MetalLB then assigned IPs to Kubernetes services marked as
LoadBalancer
.
Implementation Steps:
- Defined the range of public IP addresses available in our environment.
- Installed MetalLB via
kubectl
using the official YAML manifests. - Configured services (e.g., Nginx and HAProxy) with
LoadBalancer
type, enabling MetalLB to automatically assign public IPs.
The result was a seamless, scalable, and cost-efficient load balancing setup for our cluster.
Rancher Longhorn for Distributed Storage
Storage presented one of the toughest challenges. Many distributed storage solutions like Ceph or OpenEBS require each node in the storage cluster to have at least two attached disks: one for the OS and another for storage. Our setup only had a single boot disk per node.
Enter Rancher Longhorn, a lightweight yet robust distributed storage solution:
- Longhorn allowed us to utilize single-disk nodes effectively.
- It automatically handles replication across nodes, ensuring data redundancy and high availability.
- The configuration process was straightforward, even for our hardware-limited environment.
Implementation Steps:
- Deployed Longhorn via Helm charts into the Kubernetes cluster.
- Configured the storage classes to use Longhorn’s dynamic provisioning.
- Tested replication and failover scenarios to ensure data integrity.
Longhorn’s ability to adapt to single-disk nodes was a game-changer, enabling us to meet our storage requirements without upgrading hardware.
The Final Architecture
The resulting architecture looked like this:
- Bare-Metal Nodes: Provisioned with Talos Linux, these nodes formed the backbone of our Kubernetes cluster.
- Networking: MetalLB managed the public IP address pool and provided seamless load balancing across services.
- Storage: Rancher Longhorn delivered a distributed storage solution optimized for our hardware constraints.
To manage the cluster, we relied on Kubernetes' native tools and Talos Linux's API-driven approach. These tools allowed us to handle provisioning, configuration, and maintenance without introducing additional complexity or external dependencies.
Cost Savings and Resilience
One of the biggest wins of this project was the cost savings. By migrating nearly all workloads from the cloud to our on-premises Kubernetes cluster, we reduced infrastructure costs by over 70%. However, the cloud wasn’t entirely eliminated—we retained cloud storage for:
- Archival backups: To protect against catastrophic failure of the entire cluster.
- Disaster Recovery: Serving as a secondary layer of data protection.
With this hybrid approach, we achieved a balance between cost-efficiency and resilience. In the event of total failure, IaC tools combined with our cloud backups allow us to rebuild the entire infrastructure in a matter of hours.
Future Plans
While the current setup is robust, we’re planning additional improvements:
- Disaster Recovery Center (DRC): A dedicated facility for high-availability and data protection.
- Hardware Security Modules (HSM): To secure sensitive cryptographic keys.
- Centralized NAS Storage: For enhanced data sharing and accessibility across teams.
These enhancements will further strengthen the infrastructure, ensuring data protection, scalability, and long-term sustainability.
Key Takeaways
Building an on-premises Kubernetes cluster on bare-metal is not for the faint-hearted, but with the right tools and strategies, it’s achievable. Here are some lessons learned:
- Leverage Specialized Tools: Solutions like Talos Linux, MetalLB, and Rancher Longhorn simplify complex challenges.
- Plan for Failures: Hybrid setups combining on-premise and cloud backups provide resilience.
- Automate Everything: Infrastructure as Code is indispensable for managing complex systems.
For organizations looking to reduce costs and gain control over their infrastructure, bare-metal Kubernetes is a viable option. While the journey is challenging, the rewards—both financial and technical—are well worth the effort.
This experience has not only optimized our infrastructure but also laid the foundation for innovative solutions in the future. We hope to share our findings in technical conferences, inspiring others to explore similar paths. Stay tuned for updates as we continue to refine this system and push the boundaries of what’s possible with Kubernetes on bare-metal.