How to Build Your Own Resilient & Affordable Cloud / Data Center
This is a clear recipe on how to build repeatable, ultra-stable infrastructures designed for maximum uptime; and affordable data centers from the ground-up.
Before we begin, four key elements need to be considered…
- Location, Location, Location. Pick a location in the building that can be secured behind lock and key. When assessing the new home for your Hypervisor server (covered later) you need to ensure the system stays cool, dry, and powered. This new home has to be resilient in the face of any environmental factors. Another thing to remember is that bigger is better when it comes to battery backups. Ensure that you build redundant power supplies into your physical server. Get 1500 Milli Amp Hour or MAH UPS. I would not recommend going smaller. The APC product line is a great resource for all the peripheral hardware needed. That kind of strength will easily run a single blade server for over an hour. You can always get longer lasting battery backups, and plan on replacing these every 3 to 5 years. With two power supplies in your server for failover you can simply connect one UPS to each side, and voila! Now you have a system that will stay up for at least two hours during a power outage, allowing personnel time to react appropriately in order to avoid the worst and most expensive outcome.
- Environment. Have fire suppression systems tested in the server room and ensure there is proper cooling and ventilation. Ideally the room will be as cold as 68 degrees and no higher than 75 degrees. Anything 85 degrees or higher is not recommended and will lead to hardware degradation. Humidity should be kept between 40% and 60%. When humidity is too low, you risk electrostatic discharge, and too high a level can allow condensation to cause a short. Install monitors for the temperature and humidity for alerts and build out your server for notifications on hardware problems.
- Lockdown. Physical access to the server must be limited. It is far too easy for server closets to become a nightmare of random office supplies and knick-knacks. There is also the issue of a bitter or malicious employee who wants to walk in and unplug something. Easy physical access would give them just that opportunity. In either case the server room is a home, and only one person and one thing belong in there: you and the server. Auto-locks and motion activated cameras are recommended here. Additionally, everything on the server is logged into files, so the server room and activities should be too. Implementing these simple things will remove another variable in the continuity scheme.
- The True Cost. Regardless of analysis, time is money and in production environments that can be 365 days a year, 24 hours a day, every day. Take a moment to make estimations and gather some data relevant to your revenue. Apply it to the length of downtime. This math should not be too complicated… regardless of organization size.
If you make $100 a minute, every 10 minutes is $1,000. Do not skimp on the building phase to save money and get underrated equipment. You could be purchasing that equipment twice and will probably end up buying the extra equipment you tried to avoid in the first place.
The first ingredient for our build, will be Dell Blade servers and environment. These off-the-shelf servers are fantastically reliable when they are provided with healthy voltage regulation. The hardware must be redundant throughout. The server will have failover memory, CPU drives (or brains of the computer) and power supply. Simply design the build with separate root and data drives, Redundant Array of Independent Drives (RAID) 1 and 5 respectively and you are off to the races! These multicore processors are quite powerful. Be sure to calculate your needs based on the amount needed for each virtual server you plan to run. Pad your needs by doubling the result. You will be grateful years down the road when you need to add two more virtual machines to your existing hardware and have the capacity necessary to do it.
This server should reasonably cost between $4000 - $6000 per unit. Depending on your resource requirements a minimum of 8gb of memory or more for each system running on the server. Also, you must allocate memory, just for the platform of the host to run. For this portion you will want to plan for minimums of 8gb for the host, 8gb for the backup core, and finally 8gb per virtual machine (VM). I.E. if you want to run one 2019 Windows server, Quest Appassure, and two VM’s we need 32gb of memory as a baseline. I recommend going high on memory as it allows for future scalability.
You will be purchasing two servers per location. One will be production and the other will be the “hot spare” or stand-by system. Catastrophe is unavoidable, and with the best laid plans, surprises can always happen. Building this design allows for the first hypervisor to replicate to the other in real time.
You can go the route of one server per location but then the considerations become the speed of replication and the turnaround time on recovery. The backup software used in this model replicates changed data incrementally, so all of that data has to go over the Wide Area Network (WAN) back to the hub. This may be impractical if the servers are used for file services because you could easily outpace throughput with data. Consider we went with the twin server option mentioned above. We will then want to get new virtual machines into our hypervisor and test production as soon as possible. Once you’re satisfied that the server is ready to go live, I export the system to an external drive and keep it offsite. That way, in a total building catastrophe you will have a last resort fall back in place. With that complete we can dig into the application driven backups and real time hot spare setup.
In my experience the Quest Appassure product is top shelf for SMB disaster recovery because while it is not lightweight on resource needs, it is incredibly versatile. This system can be tuned granularly to accommodate variation on bandwidth, or in other words, it can be tuned to tolerate timeouts which is critical for offsite replication. Ultimately our goal here is to use Appassure to do three things: maintain backups, replicate to a local server for redundancy, and most importantly provide offsite replication to guard against total catastrophe. This I find to be almost an insurance plan, protected for continuity and catastrophe.
It is time to start building your backup schedules. The best part about scheduling in this software is that you can back it up as frequently as your disk writes. If you upgraded your drives to Solid State Drive (SSD) this could be a fully restorable image every 15 minutes. During set-up, the system asks you to set a size limit for the storage of your virtual machine backup data. The storage cycles out the oldest for the newest image. This system is super friendly to the admins, and relieves the need to maintain storage and manage the full back-up libraries.
After setting the backup schedule, you can begin setting up replication and build your second server, after getting it in the rack. Next, launch your Hypervisor manager on the second system for reference, load up the AppAssure web console, and simply follow the steps for replication. This will take that living back-up and create the living hot spare on your second system. I will not go into huge detail on set-up, as they have made it quite simple to push. Once this is complete, you now have a system that is locally resilient. If the first fails, you can start up the exact same system in seconds. If you performed upgrades to the platform and hit severe problems, you can restore from a back-up prior to the incident. Your downtime is only limited by the transfer rate over the network and disk to disk.
Now for the final step, offsite replication.
This is probably the longest part of the setup, not because of the software, but rather because of throughput on internet access. For this build, it is best to utilize two existing connections and two independent internet service providers. It is difficult, but important to get both for fault tolerance… especially if your business has point- of- sale equipment or functions that rely on access. Everything today is web-based, regard this as mission critical for the design. I suggest using a blend of Fiber Optic access for primary and Coaxial (Cable) driven for backup. The uptime for both types is exceptional, and rely on independent systems, having no effect if either fail. If Fiber isn’t an option, use a Coaxial connection as primary and a T1 as backup. I do not recommend DSL for this (just an opinion, but really that access is driven on antique phone systems). Next, we need hardware designed for failure. These systems detect outages and failover to the next available access seamlessly. Fortinet makes an excellent product for this and provides routers that have dual or even triple failover. Or you can also insert a 4g stick.
Connect your hardware after provisioning and establish your VPN back to the offsite location. This can be your corporate office or even a cloud hosted server. With an Appassure core installed to the offsite system you can begin the long haul… offsite initial backup image (Seed). If your production VM is 80gb, you have to either push 80gb over your WAN to the remote location or physically copy it locally and walk to the location to store the seed. As part of the exercise try seeding over the WAN as it will provide insight on how your bandwidth will impact your remote backup schedule.
I have seeded multiple 100gb systems over 50mb internet connections. The time it took for each was about a week. When you start this process, expect issues in timeouts. By default, Appassure is tuned for LAN speeds. However, the granular controls allow you to adjust not only the size of the packets transmitted, but the timeouts. Each of my seeds would be comparable to a one-week cell phone call that never dropped. Once you have your tuning complete and tested timeouts have been addressed you can begin the schedule for the remote system. Since the data is travelling over significantly slower and complex networks, your timing on the back-up schedule is inherently going to have to be less frequent than the local backups. Daily backups can easily push 4gb or higher in incremental changes, that would be two hours of time for each remote backup. Appassure’s method for incremental VM backups is as lightweight as you can get, and the deduplication features built-in really scaled down the total storage needed for system backups. Check out these great articles on replication methods and deduplication.
Congratulations! You now have a system built with local redundancy, local and offsite backup. You have a thrice born internet access solution for the business and ensured connectivity under every preventable scenario (including that the last wireless option for 4g negates physical interruption from disaster, cell towers do not fail).
Your twin systems are protected from hardware failures because of the redundant components, and one system backs up the other with a real-time hot spare. You have now built a private data center, with the VPN and offsite backups your cloud, and have covered every preventable contingency from affecting the heart of the business.
The cost is estimated at under $20,000 dollars per location to source the hardware and internet access, and the whole setup is a repeatable, supportable template. This consistency across sites is peace of mind for the systems administrators and the business.
If you are currently in the middle of a build and have some questions, please feel free to DM me on LinkedIn. I know how stressful these things can be and am happy to assist!