Downtime, failures and breakdowns – Understand the true costs
This content is brought to you by Evolven. Evolven Change Analytics is a unique AIOps solution that tracks and analyzes all actual changes made in the enterprise cloud environment. Evolven helps leading companies reduce the number of incidents, improve troubleshooting time and eliminate unauthorized changes.Learn more
When it comes to mission-critical applications or data center performance quality, companies are willing to make huge investments. Unfortunately, these investments do not always deliver full performance.
Coping with system downtime
Despite the efforts that have been put into infrastructure resilience, many IT organizations continue to deal with database, hardware and software downtimes lasting from just a few minutes to several days, completely incapacitating the business and cause huge losses.
The world of IT outages can sometimes seem uncomfortable.
Despite the variety of advanced solutions and the growing amount of data being collected by major enterprise software vendors and IT departments (from ERP to CRM and more), outages are still a valid and daunting threat to the industry.
On the other hand, IT outages have somehow become an inherently accepted, even expected, part of corporate life.
This is counterintuitive...
IT downtime revised
While IT professionals experience downtime from time to time and then focus fully on managing it, the business organization as a whole suffers "financial pain" from effects that are typically very significant.
In the past we've delved deeply into the many ways in which IT downtime can impact the bottom line of organizations (you can read more about it here –Cost and scope of unplanned outages). We looked at various aspects, from direct sales losses and damage to reputation to indirect effects such as loss of productivity.
Now, I want to revisit the topic and examine how organizations should address and assess threats to their IT operations, including systems, applications and data, by analyzing robust (and established) benchmarks that represent the potential costs of downtime and outages .
Measuring the failures of big brands
When should the industry start measuring the financial impact of large brand failures like those that have occurred recently?Facebook, Dieone that hit hundreds of thousands of Lloyds Bank customers, or theJetstar failurethat caused hundreds of flight delays?
In other words, when is an outage "significant enough" that a cost analysis becomes valuable for the industry to learn from and predict the impact of future outage events?
Well, apparently at some point the outage has an impact that cannot be ignored PR-wise. That's the point of no return, followed by estimates of the financial impact.
The cost of downtime varies significantly between industries. The size of the company involved is obviously a critical factor, but not the only big one. The role of the IT systems in the company is also crucial.
Putting a numeric value behind an IT outage means pre-defining its impact on multiple business and organizational aspects so the entire industry can learn and optimize accordingly.
A failure of a critical application can result in two different types of losses:
- Application service outage – the impact of downtime varies by application and organization;
- Data Loss – The potential loss of data due to a system failure can have significant legal and financial implications.
I'm sure you would agree that today's data centers should never go down; Applications need to be available 24/7, and internal (let alone external) end users worldwide need to be able to rely on data center availability (for critical data and application availability) at all times.
Well, reality bites. This is not the case in the back office (i.e. within the data center). No organization enjoys 100% uptime. Should you want to achieve 100%? Secure. But you should also develop a deep understanding of the impact of downtime and ways to minimize it.
The worst nightmare of all time? Probably what happened to you...
Some previous outages have turned into PR disasters, like the mythological Virgin Blue debacle of 2010 or the most recent one that affected Facebook.
Why? The mass impact probably had something to do with it.
As a reminder, Virgin Blue's outage prevented passengers from boarding flights for 11 days (!!), resulting in negative press, damaged reputations and millions in losses.
More specifically, Virgin Blue's reservations management company, Navitaire, eventually compensated Virgin Blue for more than $20 million (Navitaire booking error brings Virgin $20 million into compo).
There are many other incidents that still attract media attention. Here's just a recent oneUSA Today article about Wells Fargo outagewho prevented customers from accessing their accounts for many hours.
I can safely say that anyone in IT would agree that failures or downtime are VERY bad for business. They are undesirable, very damaging financially and must be combated with all available means.
Misconfigurations are key
The IT Process Institute's Visible Ops Handbook has reported in the past that "80% of unplanned outages are due to poorly planned changes made by administrators ('ops') or developers" (Visible operations).
The Enterprise Management Association reported that 60% of availability and performance failures are due to misconfigurations.
How much does it cost?
Downtime can cost organizations $5,600 per minute and up to $300,000 per hour in web application downtime (per a2014 Gartner-Analyse).
The average hourly cost of enterprise server downtime worldwide, 2017-2018:
Application maintenance costs are increasing at 20% annually. But that can't solve all your problems. A previous industry survey found that at least a quarter of the downtime surveyed was caused by configuration errors. (How much will you spend on application downtime this year?).
How common is downtime or breakdowns?
Ok, downtime can be a financial nightmare. That part is clear. But if you want to properly assess the potential risk of failure for your business, the immediate question should be, "How likely is it to happen?"
Those:data center knowledge
Ok, so failures are far too common to be ignored by thinking "I probably won't see a major failure". Now the question arises as to how you can calculate the specific risk for your company.
Production and application downtime costs made transparent
Unplanned outages must be resolved by IT. Nonetheless, and as already mentioned, at the end of the day these failures affect the entire organization.
An important part of a thorough downtime risk assessment is estimating how much money you will lose per hour (or minute, or whatever time interval you choose) in the event of downtime.
For businesses that depend solely on the data centers' ability to provide IT and network services to customers -- such as telecom service providers or e-commerce companies -- downtime can be particularly costly, with the highest cost of a single event exceeding $1 million (more than $11,000 per minute) according to expert estimates.
In a USA Today survey of 200 data center managers, over 80% said their downtime costs exceeded $50,000 per hour. Over 25% reported downtime costs in excess of $500,000 per hour (!!).
According to another survey, while companies cannot achieve zero downtime, one in ten companies states that their availability needs to be greater than 99.999%.
To get a thorough understanding of the impact of production and release downtime, let's take a look at how the consequences of downtime manifest themselves.
Downtime costs - per year or per incident?
AStudy 2017found that 46% of 400 IT decision makers experienced more than four hours of IT-related downtime over a 12-month period; 23% said they incur costs between $12,000 and more than $1 million per hour.
Over 35% admitted they are unsure of the cost of an outage to their business.
If you ask Delta Airlines, which had to cancel 280 flights due to one failure in 2017, the losses from a single failurecan reach over 150 million dollars.
A few years ago, Dun & Bradstreet reported that 59% of Fortune 500 companies experience at least 1.6 hours of downtime per week.
If you take an average Fortune 500 company (or a company with at least 10,000 employees) and assume that they pay an IT team member an average of $56 an hour, then (assuming all IT does it). is to fix the downtime) just work Some downtime for a company this size would reach $896,000 per week, which is more than $46 million per year (Assessing the financial impact of downtime).
Of course, the reality is more complicated since you have to take into account many parameters such as the time of the event (mid-week or weekend? day or night?) and more. However, understanding the cost of downtime goes a long way in estimating your exposure to risk and the ROI of tools that can help minimize the impact of downtime.
Has the industry been able to learn from the past and minimize collateral damage in the event of an outage?
How have things changed from the past?
So we already know that there is still downtime and outages that the industry has yet to successfully eliminate. But how have their costs changed over time? Are these incidents less harmful today?
In 2010,an investigation by Coleman Parkesfound that IT downtime costs companies a total of more than 127 million hours per year - an average of 545 hours per company - in employee productivity.
In 2009 it was reported that the average cost of downtime varies significantly by industry, from about $90,000 per hour in the media sector to about $6.48 million per hour for large online brokers (How to quantify downtime).
According to a survey of IT managers conducted during these years, companies are becoming more aware of the direct financial cost of computer failures. The survey found that one in five companies loses $12,000 an hour from system downtime (How to quantify downtime).
As mentioned above, later analysis conducted by Gartner in 2014 found average costs of $5,600 per minute and over $300,000 per hour.
As early as 2004, a conservative estimate by Gartner put the hourly cost of computer network downtime at $42,000. Accordingly, a company that suffers an above-average 175 hours of downtime per year can lose more than $7 million annually. However, the cost of each outage affects every business differently, so it's important to know how to calculate the exact financial impact (How to quantify downtime).
It makes sense to think that the cost of downtime will only increase over time (since we all rely more on data systems today). You can therefore understand why past dates can be multiplied by a significant number to reflect today's reality...
Every minute counts
More than a decade ago, the average cost of data center downtime across industries was estimated at approximately $5,600 per minute (Unplanned IT outages cost more than $5,000 a minute), a number accgardener, remained the same until 2014. The previous Ponemon Institute study referenced above calculated the minimum, mean, mean, and maximum cost per minute of unplanned outages, based on inputs from 41 data centers. The largest cost of an unplanned outage has been found to exceed $11,000 per minute.
On average, the cost of an unplanned outage is likely to be over $5,000 per minute.
It only gets more meaningful
AStudy 2013saw an increase of over 41% over the previous averages described above and averaged more than $7,900 per minute.
AITIC survey 2015clearly shown that hourly costs have increased by 25% to 30% (compared to 2008 data).
Impact of downtime per year
A previous analysis by Gartner calculated that downtime can reach an average of 87 hours per year. Obviously this is the sum of many failures - from a few minutes to several hours (An average large enterprise experiences 87 hours of network downtime per year).
How have things changed?
a later oneResearch from 2011revealed that while the industry has been successful in combating the downtime epidemic and reducing its frequency, we are still seeing significant downtime and huge revenue losses (source:resulted in over 3 million (apparently Whatsapp users) migrating to Telegram)
The impact on reputation and loyalty
How much is your business reputation worth? This can be extremely difficult to assess, as can the long-term impact of a damaged reputation and its impact on sales and profitability.
In this case, the cost of downtime includes lost customers (both short- and long-term) and other tangible items that reflect the cost of reputational damage, such as inventory declines, marketing time (crisis and brand recovery management), and media budget required to restart and revitalize the profile of an organization.
Which parameters should influence your calculation?
When trying to estimate the cost of downtime, there are the obvious direct costs (e.g., lost business during the downtime). However, many indirect costs such as employee overheads or reputational issues mentioned above should also be factored in.
Personnel costs come from the cost of firing “war room” tasks focused on getting IT systems up and running again, the cost of delays in all other scheduled tasks, the cost of staff overtime (if applicable) and more. Add to this the value of data loss, emergency maintenance fees (especially if the outage occurs outside of business hours), and additional repair costs that can persist long after service is restored.
Of course, you need to calculate these costs when estimating the impact of downtime, as they are usually very significant. But even a rough estimate can prove extremely helpful in understanding the risks and deciding what level of technology to lean on to combat them.
There's also the impact of lost sales. To get an accurate estimate of total lost sales, the impact percentage needs to be increased to reflect the actual lifetime value of customers who permanently switch to a competitor. For example the Facebook (and Whatsapp) outage I mentioned earlierCost Conscious: Denying the true cost of network downtime. What loss of revenue results from these users serving fewer billable ad impressions?
Stock down 25%
Although it is difficult to quantify so many parameters, they are significant and significant. For example, when Amazon.com went offline for several hours in its early days, its stock fell 25% in a single day (Cost Conscious: Denying the true cost of network downtime)!
In thisAmazon Cloud OutageFor example, the company continued to struggle to bring its cloud services back online. As a result, many customers questioned the reliability of their cloud and Amazon's communications related to the outage. Other customers felt they should be compensated for the downtime as part of their SLA.
I know you're curious: As for the SLA, despite the nearly four-day outage, Amazon's EC2 SLA was not breached (Seven lessons learned from Amazon's failure).
Downtime costs: Calculate yourself
How much do you have to lose from an unexpected server or business application failure?
According to multiple sources, the easiest way to calculate potential lost revenue during an outage is to use this equation:
|LOST PROFITS||=||(GR/TH) x I x H|
|GR||=||gross annual sales|
|TH||=||total annual working time|
|H||=||Number of lost hours|
How can the risk of breakdowns and downtime be minimized?
Downtime and failures are catastrophic, but they don't have to be that severe. By using solutions that focus on getting to the root of the problem, failures can be prevented before they even happen.
Evolved change analysisDeveloped a unique AIOps solution that focuses on change - the true cause of performance incidents. Evolven helps enterprise IT and cloud ops teams prevent and resolve incidents before problems occur.
Contact usto see how we are helping leading companies reduce the number of incidents and MTTR.
This new report suggests that unplanned downtime now costs Fortune Global 500 companies 11% of their yearly turnover – almost $1.5tn. This is up from $864bn (8% of turnover) two years ago.What is downtime What are the costs associated with downtime? ›
Downtime cost is defined as any profit that a company loses when its equipment or network stops functioning. The cost of downtime implies not only direct financial loss but can have an impact on your company in at least the other 4 ways.What is downtime failure? ›
In industrial environments, downtime may refer to failures in production equipment. This type of downtime is often measured as downtime per work shift or downtime per a 12- or 24-hour period. Downtime duration is the period of time when a system fails to perform its primary function.What are the two major considerations when calculating the cost of downtime? ›
Calculating Downtime Cost
The duration of the downtime and the cost incurred per minute you're offline are the two variables that most affect the financial impact of an outage.
The first way to measure your equipment downtime is in actual time. For a given asset (or set of assets), record the amount of time during each month that the asset is broken down. Keeping a running tally and comparing it to past months will help you know when an asset is having more issues than normal.What are the three types of downtime? ›
Common categories of downtime include excessive tool changeover, excessive job changeover, lack of operator, and unplanned machine maintenance.How do you explain downtime? ›
a time during a regular working period when an employee is not actively productive. an interval during which a machine is not productive, as during repair, malfunction, maintenance.What are the main causes of downtime? ›
This can be due to several reasons including hardware or software failure, human error, malicious attacks or natural disasters. Since unplanned downtime is unexpected and occurs without a warning, preventing it can be a challenge.What are the two types of downtime? ›
Downtime falls into two categories: planned and unplanned. Planned downtime is notable because it offers advanced warning and gives users a chance to prepare. Planned downtime is usually done for upgrades or maintenance to the network infrastructure.What is the difference between downtime and breakdown? ›
Downtime can be planned or unplanned activity but the breakdown is entirely an unplanned activity. A planned event such as scheduled downtime is cost-effective compared to an unplanned event such as a sudden breakdown. Planned downtime does not delay production whereas breakdown time can cause delays in production.
Breakdown is the result of failure and the effect that failure has over the failure developing period. For example, if the temperature of your electric motor remains too high, it can cause the shaft to snap, creating a breakdown.How do you calculate downtime cost? ›
To get a quick estimate of your company's probable downtime costs, use the following formula, based on the size of your business and the number of minutes your most recent incident lasted: Downtime cost = minutes of downtime x cost-per-minute.What is the average cost of downtime? ›
91% of enterprises report downtime costs exceeding $300,000 per hour. For 44% enterprises, hourly costs exceed $1 million per hour. And for 18% of enterprises, downtime costs exceed $5 million per hour.What are the three hard costs that must be included in cost structure analysis? ›
Regardless of the exact nature of your business, your cost structure model will likely include fixed and variable costs, along with sunk costs and opportunity costs.What are the two costs within the operations function? ›
A business's operating costs are comprised of two components, fixed costs and variable costs, which differ in important ways.What are downtime metrics? ›
The most well-known downtime metric is Mean Time to Repair (MTTR). The MTTR metric reflects the average time it takes to troubleshoot and repair a failed piece of equipment.How do you mitigate downtime? ›
- Plan for Recovery. The best way to ensure a fast recovery is to plan ahead. ...
- Keep Everything Up to Date. ...
- Educate Your Workforce. ...
- Install a Backup Power System. ...
- Test Your Infrastructure. ...
- Consider Disaster Recovery as a Service.
To measure the KPI, you can track server downtime either as a comprehensive figure (including both planned and unplanned outages), or measure each individually. In the former case, simply add up all the times your servers were offline for the desired measurement period (daily, weekly, monthly, or yearly).Is downtime a KPI? ›
Revenue is directly impacted by downtime because the less equipment is running, the fewer products are made and sold. Therefore, one of your maintenance KPIs is downtime. All sorts of quantifiable actions can influence downtime, such as the mean time to repair (MTTR) or planned maintenance percentage.What is the industry standard for downtime? ›
World Class Standards For Downtime
Aim for unscheduled downtime to be 10% or less.
- Not-Utilizing Talent.
- Motion Waste.
- Excess Processing.
Consequences of unplanned downtime
Lost productivity and revenue: Every minute of downtime can result in lost productivity and revenue, affecting a business's bottom line. Decreased customer satisfaction: Unplanned downtime can lead to delayed deliveries, canceled orders, and frustrated customers.
Downtime restores attention and #motivation, fosters #creativity, improves work #efficiency and is essential for #peak performance. Think about the word recreation for a second and break it apart.What is downtime in maintenance? ›
In manufacturing, “downtime” occurs when an unplanned event halts production for a period of time. This event can be a malfunction, repair, or changeover of tools or equipment. Maintenance downtime in particular is when a machine is not operating or being productive due to required maintenance work.What is downtime in time management? ›
Downtime management enables you to exclude periods of time from being calculated for events, alerts, or views that can skew CI data. To access. Administration > Service Health > Downtime Management. Alternatively, click Downtime Management.What are examples of breakdown? ›
Noun The factory has had frequent equipment breakdowns. Both sides are to blame for the breakdown in communication. The irretrievable breakdown of a marriage can be grounds for divorce.What is an example of breakdown maintenance? ›
Examples of breakdown maintenance
An example of planned breakdown maintenance is run-to-failure maintenance, where an organization has decided that letting a piece of equipment break down before servicing is the most cost-effective and least disruptive option.
Breakdown maintenance, sometimes called run-to-failure maintenance, occurs when an asset completely breaks down and needs repair to resume operation.What are the 4 stages of failure? ›
Stage 1: Shock and Surprise. Stage 2: Denial. Stage 3: Anger and Blame. Stage 4: Depression.What are the four levels of failure? ›
- Unsafe acts of operators (e.g., aircrew),
- Preconditions for unsafe acts,
- Unsafe supervision, and.
- Organisational influences.
The direct costs of downtime are the expenses you can easily quantify and attribute to a specific downtime event. They include the: Cost of lost employee productivity: This expense captures how much money was lost because employees could not work during the downtime event.How do you calculate downtime cost per hour? ›
The cost per hour of downtime is calculated by adding labor costs per hour to the revenue lost per hour.How do you calculate availability and downtime? ›
Availability = Uptime ÷ (Uptime + downtime)
That asset also had two hours of unplanned downtime because of a breakdown, and eight hours of downtime for weekly PMs. That equals 10 hours of total downtime.
All manufacturing downtime reduces overall output by stopping production. Unplanned downtime can cost 15 times more than planned downtime. The loss of revenue during any type of asset maintenance can be as high as $3 million per incident.What are the 4 types of cost analysis? ›
There are four main types of cost analysis: cost-feasibility, cost-effectiveness, cost-benefit (also referred to as benefit-cost), and cost-utility. Each type of analysis uses the same initial approach to assess resource costs but answers different questions.What are the 4 types of costs? ›
Costs are broadly classified into four types: fixed cost, variable cost, direct cost, and indirect cost.What are the 3 pillars of costing? ›
23.6. 3 Strategic cost management has three important pillars, viz., strategic positioning, cost driver analysis and value chain analysis.What are three 3 types of operations costs? ›
The operating cost includes the cost of goods sold (COGS). Aside from COGS, operating costs also include the other operating expenses that are often called selling, general, and administrative (SG & A). Those three cover rent, payroll, overhead cost, as well as raw materials and other maintenance costs.What are two basic costs? ›
The two basic types of costs incurred by businesses are fixed and variable. Fixed costs do not vary with output, while variable costs do.What are the two components of cost? ›
Elements of Cost: Direct and Indirect Labour, Administration Overhead.
The average cost of downtime is significant. Each minute costs an average of $9,000, according to the Ponemon Institute, bringing the downtime cost per hour to over $500,000.What is actual cost reporting? ›
Cost reporting is a process used to inform a client (or other party) about the magnitude of a construction project's predicted, or actual cost. This can be expressed either in absolute terms or as a variance compared to the project budget.How much does downtime cost in a data center? ›
According to Gartner, downtime costs $5,600 per minute on average. This results in average costs between $140,000 and $540,00 per hour depending on the organization. Some factors that contribute to the costs associated with downtime include: Lost sales.How much does downtime cost the auto industry? ›
For example, in the auto industry, downtime can cost up to $50,000 per minute. That's $3 million per hour. 400 The true downtime cost includes a variety of wasted business support costs and lost business opportunity costs because resources were needed to resolve a downtime incident that probably didn't need to happen.How does downtime affect a business? ›
Repeated downtime events can result in unhappy customers, which can quickly translate into bad customer reviews and tarnished brand image. Data Loss: Downtime affects not only your business but your clients as well. Downtime due to cyberattacks, server or network outage can result in corrupt, damaged or stolen data.What does downtime mean in SLA? ›
“Downtime” means the time during which the Service Offering is unavailable (as measured from Dynamic Signal's production data center internet connection points), excluding Force Majeure Events, Scheduled Maintenance, and Scheduled Updates.How do you calculate true cost? ›
Divide your total annual cost by your billable hours to find out how much you need to charge per hour to cover expenses. Now multiply the above number by your profit margin to get your hourly service rate. This rate will allow you to cover all expenses while still making your desired profit.How do you track actual cost in a project? ›
- Establish Cost Tracking Systems. The first rule is to have a system in place to deal with your expenses and have the capacity to track them. ...
- Provide Online Access. ...
- Identify Budget Items. ...
- Create a Project Budget. ...
- Assign Someone to Track Expenses. ...
- Track and Control Expenses In Real-time.
For example, an auto repair shop may estimate that vehicle repairs will cost $1100, but the actual cost may actually be $1200. A customer might not be aware of the actual cost until the expenses are incurred during the repairs.What is 5 nines availability downtime? ›
Availability is normally expressed in 9's. For example, “5 nines uptime” means that a system is fully operational 99.999% of the time — an average of less than 6 minutes downtime per year. The chart shows what impact various availability levels have on your server downtime.
To get a quick estimate of your company's probable downtime costs, use the following formula, based on the size of your business and the number of minutes your most recent incident lasted: Downtime cost = minutes of downtime x cost-per-minute.How do you measure data downtime? ›
- Labor Cost: ([Number of Engineers] X [Annual Salary of Engineer]) X 30%
- Compliance Risk: [4% of Your Revenue in 2019]
- Opportunity Cost: [Revenue you could have generated if you moved faster, releasing X new products, and acquired Y new customers]
- = $ Annual Cost of Data Downtime.