Goodbye Blinky Lights: The Art of Large Scale Management

Benjamin Franklin wrote, “…in this world nothing can be said to be certain, except death and taxes.” For the system administrator, we can add one extra certainty—server failures.

Today, resilient applications server failures rarely lead to outages. However, an accumulation of these failures results in idle capital and a higher likelihood of application failure both of which are costly to a business.

At first, each server in a small company can be given individual attention to correct failures by combing through logs or merely looking at the front bezel for the single blinky Health LED (HP) or Error LED (Dell) light. The failures result in frequent support requests, and the process for repairing servers and stocking spare parts is mostly ad hoc.

As a company grows and the servers exceed 500, the process of diagnosing and repairing servers starts consuming much more of the system administrator’s timewho would much rather manage critical applications and not hardware.

Based on past experiences, here is an estimation of the typical Annualized Failure Rate (AFR)for a simple web server and the number of annual hours an administrator dedicates to hardware failures. The data’s basis is the assumption that every single failure is a maintenance event. On average, each failure requires three hours to troubleshoot, repair, and manage logistics.

Component AFR Quantity     Failures/year
SSD   0.50%     1    2.5
Power supply   1.50%     1    7.5
DRAM     1.00%     4    20
Fan   1.00%     4    20
Motherboard   1.00%     1       5
CPU   0.30%     2       3
Total AFR (serial)   11.6%    58
Annual Administration Hours        174


Using administration hours to troubleshoot and repair hardware is not glamorous work. The overhead of dealing with your server vendor’s customer service department can cause a system administrator to demand a migration to the cloud. However, moving to the cloud isn’t always an option for many reasons including cost, performance, and business requirements. So, administrators begin to do what they do best
automate.

For companies using in-house infrastructure, an automated management system can vary in complexity but can be the best way to manage the increasing costs from server failures.

The evolution of hardware workflow automation takes time, and there is a range of sophistication. Companies with fewer than 500 servers have systems that can detect failures, remove failed servers from production, alert vendors to the crash, and schedule a third party repair technician without administrative intervention.

Yet, many large companies operating with over 50,000 servers still rely on technicians patrolling the data center aisles looking for those infamous blinky lights. These servers are tagged and then the administrator begins the problematic process of scouring over spreadsheets, making phone calls, and holding weekly repairs meetings with vendors.  

Diagnostics

I know of no commercially available tools or open source software that can diagnose hardware issues across multiple server vendors. There are however great tools for processing large volumes of log files, and servers are great at generating many lines of hardware specific logs on a daily basis.

Some of the most basic tools collect hardware logs and filter them through a library of essential error messages. These libraries grow over time as administrators improve their skills at identifying failure messages. The volume of error logs also grows due to newly installed hardware and administrators becoming savvier at polling specific components.

In larger scale deployments, tools utilizing machine learning are written to increase the diagnostic accuracy, and the outcome of each repair attempt refines the machine learning techniques. These tools help manage failures recovery, identify failure patterns to resolve, and improve data center utilization.

Repair

The physical server repair is a small portion of the overall process.

A thorough repair process includes tracking the time and materials used. Serial numbers of specific parts added and removed from servers are logged to ensure proper stocking of parts depots, the creation of certificates of destruction for storage material, and to track component recidivism.

With a comprehensive toolset, operators can improve the ratio of servers per technician. For example, Facebook claims to operate 20,000 servers for every technician.

Verification and Return to Production

Once a technician marks a server as repaired, the automated tools verify that the server is functioning. This tool usually begins with a fresh installation of the operating system or system image. Low-level hardware burn-in tests are performed on the server to stress components and can include reading/writing from memory and disk, CPU load, network tests, and verification of operating parameters.  

Servers that fail burn-in tests return to the repair queue. Servers that pass all tests are returned to a free pool and made available for allocation.

Data collection is a crucial part of building management automation, and it requires a large amount of development time. Unfortunately, the large variety of hardware components makes the task complex.

The Intelligent Platform Management Interface (IPMI) specification is designed to create a uniform management interface across different types of hardware from different vendors. Over time, vendors have modified the IPMI interface and created custom flavors like the Dell Remote Access Controller (DRAC) and HP Integrated Lights Out (iLo). These customized interfaces broke away from the IPMI standardization but added functionality that made the job of gathering data even more challenging.

Recently, a new standard called Redfish has been gaining popularity. It promises to bring standardization to data center infrastructure management and help simplify the task of collecting data from multiple types of hardware.

Lightbits System Management

At Lightbits, we realize that automation is a critical part of building scalable infrastructure and have multiple interfaces for managing the LightOS™ open storage platform. REST APIs and GRPCs are available for general systems management, and Prometheus exporter is available for monitoring systems metrics over time.

Each interface’s goal is to allow easy plug and play into management and monitoring systems. For example, LightOS flags bad SSDs and alerts on them via the REST API. These API calls can be piped into a repairs ticketing system and queued directly into a technicians workflow.

Similarly, LightOS can alert system administrators on capacity limitations. These alerts can be fed into capacity planning tools and enable accurate predictions for future capacity requirements.  

So when the inevitable SSD or server failure occurs, the Lightbits storage solution safeguards against data loss by supporting in system and multi-box data protection. Data is protected from SSD failures using a highly optimized Erasure Coding (EC) algorithm managed in concert with the high-performance LightOS Global FTL that streamlines data management across the pool of SSDs, increasing performance and ensuring data is safe.

With this feature performance and endurance is maintained while providing the administrator buffer time to carry out the required repairs.

About the Writer:

Chief Technology Evangelist, Lightbits Labs