Atlas, a new hosting platform, will be launching soon. It’s built off technology introduced by Luna, a v6.5 platform with a stronger focus on hardening – specifically reducing unauthorized activity. The server will be activated in a couple weeks and go live in mid or late December for early migration. As technology becomes more widespread so too do opportunities to misuse and abuse technology. We must take proactive steps to limit not only the risk that our clients face but too the external consequences they leave behind in the event of a hack. These situations have shifted from “what-if” to “when” scenarios, and nothing is as rewarding as relaxing on a beach sipping a Mai Tai during Summer. Until we achieve that goal ourselves, we’ll continue to refine the platform and introduce an iteratively stronger platform so that you can have good company on a beachfront on a warm Summer day.
With any major platform release comes great care in downselecting hardware. Nothing leaves me with more disgrace than signing off on a failed platform. Apollo was the first and last of that class (RAID5 on early-gen 2.5″ 10k SAS), which leads into SSD suitability for high volume, high reliability scenarios.
SSD has matured significantly since its introduction in the 2000s so much so that it is the de facto for consumer-grade storage, but consumer grade pales to the needs of 24/7/365 enterprise applications. Hardware fails. In 14 years of running Apis, I’ve seen everything fail from power supplies to planar boards to capacitors to SCSI ribbon cables. Regardless of statistics, we’re all governed by physics and in the physical world, shit breaks.
We’re left in a predicament: speed over reliability and generally it takes years for an equilibrium to meet with an advancement in either field. SSD is broken down to a variety of formats on a microscopic level. Single, tri, multi, and enterprise-multi that determine how many states a bit may be stored as. More layers confer more storage but also more volatility and higher wear rates. Wear is bad; it’s what separates a recoverable CRC soft failure with a catastrophic, 100% data loss hard failure. I believe that in no situation that can be avoided should data safety be sacrificed for data performance. Overall performance can be tweaked elsewhere and storage should be long-term storage, not short-term data retrieval. Short-term data retrieval has always been and should always be RAM, not disk storage. Mechanical, so long as you’re not in an earthquake-prone zone, will continue for the best option in the near future in terms of cost and benefit.
For our situations, read/write is split 55%/45% and the majority of which is log files. Every situation is unique and should be evaluated carefully. Of this, the worst-case scenario server pushes around 250 IOPS. It’s split between 2 VMs each hosting ~750 domains, so each VM churning out websites, handling email, and delivering data over MySQL + PostgreSQL still fail to make a mark on 1 SSD drive in terms of its theoretical minimum (5,000 IOPS). Even more interesting, the theoretical maximum of an 8x RAID10 15k SAS arrangement is 1,000.
But there’s a nasty downside to SSD beyond performance overkill. Take a comparable 512 GB Samsung EVO 850 Pro, an SSD, which is approximately $225. Compare with a 600 GB Dell-certified 15k SAS drive at $233 ($990 via Dell, yeowch!). SSD requires overprovisioning to avoid overusing cells that results in premature drive failure. In enterprise applications, the rule of thumb is 20% leaving 410 GB usable. Now, SSD is $0.5487/MB versus $0.375/MB for an entry-level nearline SSD. This excludes consumer-grade MLC drives with far lower endurance rates and exponentially higher failure rates at 24/7/365 duty cycles. SSD has improved since its introduction, but the laws of probability still dictate that increasing the number of events will greatly amplify the outcome of at least 1 event happening. The Birthday Paradox illustrates this risk of multiplicity so well and so much so that often it’s taught at an introductory Algebra class.
Samsung officially rates is EVO 850 Pro good for 300 TBW. Augend, the most recent server retired, sported 2 300 GB 3.5″ 15k SAS drives since its upgrade in 2010 and in 6 years it processed over 723 TB of data on each drive (RAID1). Compared to other servers, it’s rather quiet. Helios, which is the largest monolithic server, has processed at least 1.3 PB on each drive since 2012. Based upon Samsung’s numbers, in 6 years at a life expectancy of 300 terabytes, Augend would have gone through 4 drives and Helios 18 drives. I can improve IOPS at the expense of reliability; yet, I would sacrifice long-term reliability to improve something that isn’t limited based upon architecture.
Data loss is scary. It’s more scary than a DoS, because at the end of it all, you have to tell your clients, “I failed. I failed to uphold the faith you instilled upon my company.” It goes beyond that: it alters your trust. Yes, there are backups just like there is insurance, but if your engine fails resulting in a catastrophic loss of your vehicle, you’d lose faith in the manufacturer. Data loss is no different. It drives your business. If I were in your shoes and suffered a catastrophic loss, I too would switch providers.
I love SSD. I use it for my desktop. But, I would never put the needs of marketing nor the opportunity to improve upon client density at the risk of jeopardizing client data. We all work and we deserve an environment to thrive. Based upon my research, I am not 100% comfortable yet with SSD handling rigorous workloads. I’m not OK with the reduction in storage nor the workarounds to preserve performance over long hauls. I am, however, OK with grabbing a Mai Tai with you on the beach without a worry in the world.
Atlas will not be a SSD platform, but I hope that the SSD improvements afford a chance to its successor when the time comes in 1-2 years.
PS: Atlas will be open for migrations in December. Got a request? Let me know on the forums!
Matt Saladna
Owner, Lead Platform Architect