Thursday 27 March 2014

If tech doesn't mirror customer value, YOUR customers should leave!! (A view on Netcetera's recent epic-fail)

I am sat here on another dreary morning, having waited over 24 hours for my company website to come back up. I am betting on a race for my hosts to get it up before the DNS propagates my website away from them and what is now a truly awful service.

I am going to take an unprecedented step and name-and-shame a hosting company for appalling service throughout this year. The company is Netcetera, and despite winning awards in the past and being a MS Gold Certified Partner, their shared hosting performance as of this year has been catastrophically bad. I have yet to experience a hosting provider have this level of failure (in both size and frequency) in such a short period of time.

It's when things go wrong that you get an idea of what goes on in a system or organisation of any kind. Indeed, in software, that's what happens when you use TDD to determine how a piece of software works. Change the code to the opposite and see what breaks. If nothing, then there isn't coverage or the code is unimportant. Indeed, companies such as Netflix use a 'chaos monkey' or error seeding practise, to test their organisation's resilience.

The problem is dealing with risks doesn't give you kudos. You don't usually see the effect of a risky event not happening, but you definitely get the stick if it does. So your success means nothing happens. For a heck of a lot of people, it's just not sexy. That's the project manager's dilemma.

Since mid-Dec 2013, certainly for me, there have been issues that have affected me and my company in some form on average roughly every 2 weeks. A screenshot of my e-mail list of tickets is given below:

Some of the email tickets from the last 3 months, showing 6 major issues, some required a lot of my time to help resolve


Yesterday morning at around 7:30am GMT, Netcetera had an outage that took out a SAN due to a failed RAID controller (OMFG!!! A SAN!). This took out their issue ticketing/tracking system, the email system for all sites and control panel console for all shared hosting packages as well as their own website client area. So customers couldn't get on to raise tickets, couldn't manage their account, couldn't send or receive emails, couldn't get on to fire of a DNS propagation (which is all I needed to do). This rang huge alarm bells for me, as the last issue was only 3 weeks ago and I started the process of migrating off Netcetera then. I got an AWS reserved instance and have been running it in parallel as a dark release to smoke test it since then, just waiting for a case such as yesterday to happen. My only regret is not doing it sooner.

What does this failure tell you

The failure of the RAID controller immediately tells you that the company obviously had a single point of failure. In a previous post, written after a month containing two high profile catastrophic failures in 2012, I explained the need to remove single points of failure. This is especially important with companies in the cloud as resilience, availability and DR should be their entire selling point as PaaS and IaaS providers. This is an area Netcetera are touting now. If you can't manage shared hosting, you are in no position to manage cloud hosting. Potential customers would do well to bear that in mind.

If their reports are to be believed, in this case, everything went through a single RAID controller on to a SAN which covered those Netcetera functions and customer sites. This is a singe point of failure and don't forget, is also your customer's dependency. RAID was developed specifically to address availability and resilience concerns and comes in many flavours/levels, including mirroring and striping, but it appears that doubling the RAID controllers or 'bigger' entities (such as another SAN or another DR site, as both include another RAID controller) never happened.

SANs, or Storage Area Networks, distribute that resilience burden across many disks in its unit. Indeed, you can chain SANs to allow data storage in different geographical locations, such as a main site and Disaster Recovery (DR) site. Thereby both halving the risk of catastrophic failure and just as equally important, reducing your exposure to that risk if it should happen. Disks within a SAN are deliberately assigned to utilise no more than about 60% of the available storage space. That way, when alarms ring, you can replace the faulty disk units, whilst maintaining your resilience on the remaining space and keeping everything online. The data then propagates to the new platters and you can adjust the 60% default threshold to factor in your data growth/disk-utilisation profiles. The fact a single RAID controller took out BAU operations for what is now over 24 hours and counting shows that they didn't DR to reduce risks for this client group nor indeed, their entire client base given the ticketing system, email and client area failed so catastrophically for everyone.

To add insult to injury, at all points, Netcetera referred to this as a 'minor incident'. I asked them what this meant, as the client area and ticket failures affected every single subscriber. I was sent the reply that it only affected a small number of servers.

Granted, I am currently peeved and yesterday, that was fuel to the fire! As an enterprise solution architect, pretty much the central focus of my role is to manifest the business vision, driven by the business value, into a working, resilient, technical operation that returns an investment to the business. When you are a vendor, your business value has to be aligned to your customers value. When you make a sale or manage the account, you are aiming to deliver services in alignment with their needs. The more you satisfy, the more referrals you get, the more money you make. The closer you are to the customer's vision of value, the easier it is to do, and the more money you make.

Failure to do that means the customer doesn't get what they want or need and should rightly seek alternative arrangements from competitors who do. They can't wait around for you to get your act together. It costs them money! In my mind, for any organisation worth its salt, a catastrophic failure for everyone is a P1 issue, even if it doesn't affect the entire IT estate, given customers pay your bills. Their business value is your business' value.

Status page. Can't go 48 hours without a problem


Often technical staff do not see the effect on customers. To them it's a box or series of servers, storage units, racks and cooling. Netcetera stating that this incident was a Minor Incident, even if it looks like that from their perspective inside the box, indicates their value and the customer's value are grossly misaligned, indicating a disconnect in the value chain. This is a solution and enterprise architecture problem.

As some commentators have rightly suggested, monitors used to display service status should also aim to attach a monetary value to that status. That way it is very visible how much a company can lose when that number goes down. It is a statistical exercise to develop this utility function, but is usually a combination of distribution of errors (potentially Poisson in nature), the cost of fixing the issue including staff, parts, overheads, utilities etc. and the loss of business in that time (to their customers as well). Never mind the opportunity costs affected by reputational losses.


Netcetera SLA


As you can see from the image above, Netcetera have breached their own SLA. Compensation is limited to 1 host credit per hour up to to 100% of monthly hosting. Which isn't a lot in monetary terms. Those hosting on behalf of others may lose business customers, or suffer conversion hits and retail cash flow problems, so the Netcetera outage has a much bigger effect on them than it does on Netcetera itself and the limit of credits means that you have to use them with Netcetera alone, which for any reseller customers who host customer sites, is next to useless if they have lost much more, and are limited to a value well below your own potential loss, hence leaving you in a financial detriment position.

This is not to say that clients of theirs can't claim for loss of business if they can prove it. Most companies have to have Professional Indemnity cover for negligence or malice which causes a detrimental loss to an organisation or individual through no fault of that organisation. For those who run e-commerce sites, are resellers for other clients, or use their site as a show piece or development UAT box, this is such a scenario and you need to explore your options given your particular context. Basically, if you can prove detriment, then you can in principle put in a claim and take it further if/when they say no.

Why I used them

At the time that I joined Netcetera in 2011, I was looking for a cheap .NET hosting package. I just needed a place to put one brochureware site and Netcetera fit the bill. On the whole, they were actually quite good at the time. Tickets were (and are) still answered promptly, the have a 24/7 support service and most issues that I have sent them have been resolved within a day. So big tick there.

As time went on, my hosting needs grew a little and I needed subdomains, so given I had used them and had only raised 2 or 3 tickets since 2011, I upgraded my service in December last year. That's when all the problems seemed to start!

If Netcetera became victims of their own success, that seems like a nice problem to have. However, that also means they grew too quickly, didn't manage that transition well and introduced to much distance from executive/senior management and the people on the floor, without capable people in between. This is a change programme failure.

How should they move forward

Netcetera are in a pretty poor position. Despite their MS Gold Partner status, certainly from what I can see, it is obvious they don't attribute business value in the same way their customers do. The list of people dissatisfied with their service updates grew across all their accounts.

main company twitter feed/sales account

The Netcetera update timeline shows events as they were reported. As you can see, they grossly underestimated how long the email would take to come back online after a restore. If it was a SAN restore, this requires a propagation time as data is copied on to appropriate SAN disk units form backup and then usually disk-to-disk. This means their claim of it being available in an hour was in no way accurate! Again suggesting that the skills to manage SANs after DR were not there.

Netcetera Status twitter timeline - Highlighted email announcements. Note time difference.


The other thing that was particularly concerning is that Netcetera do not run an active-active DR strategy for their shared hosting platform. Active-active uses two platforms running concurrently in two different locations. The data is replicated to the other site, including incrementally syncing through tools such as Veem and if one site goes down, the DR site auto-switches. Active-active gives almost instantaneous failover and even with active-passive but solid syncing, this can take all of 3 minutes and some SSD based hosts can switch to DR storage, even in a remote site, in 20 seconds. This is the first alarm bell if you plan to host commercial, critical services with them. So don't!

Conclusion

Netcetera are obviously in crisis for whatever reason. Even if you are in the lucky position of being able to give Netcetera the benefit of the doubt, I wouldn't risk you or your customer's business on it. They have obviously shown that they do not share the same technical values as their customers by their classification rules and as a vendor, who relies on volumes of clients, including bigger client offerings, this is not only unforgivable from a client perspective, but really bad business!

From a technical standpoint, I'll keep reiterating one of my favourite phrase:

"Understand the concepts, because the tech won't save you!!"


******  UPDATE *******

It's now the 28th March 2014. Netcetera's system came back online at around 4am. Having tried it this morning, the daily backups they claimed are taken for DR purposes are obviously not true (or at least, yet again they have a different definition to the rest of us). Going to the File Manager only shows data from July 2013 and the SQL Server instance is actively refusing connections.

Left hand doesn't know what the right hand is doing

Exceptions this morning



It is now 47 hours since the service failed and their ability to recover from such catastrophic problems is obviously not to be trusted. I certainly won't want Netcetera anywhere near any mission critical applications I run for myself or my clients. There were gross underestimates on the length of time this was going to take to recover from. Their SLA now adds no value as we are well past the point of having 100% of the reimbursement of a month's credit  (note, by their current SLA compensation, you don't get compensated or refunded, so the PI route is one you would have to take).

What worries me is that backups are stored on the same surface as their website data. So unless you downloaded the backup, this would have failed with the rest of the Netcetera setup and you would have lost that too. This bit is your responsibility though, so I do wonder what folk would do, for those that did backups but didn't pull them off the server. Again, this is something you expect from cloud providers anyway and is what you pay for in PaaS and IaaS. It's not just about the hardware, it's about the service that goes with it (that's really the biz definition of '...as a Service'). Whilst this is of course, not what we pay for, it is an opportunity for NC to test their processes out and it is obsious they missed that altogether. So cloud on Netcetera is a definite NO!

2 comments:

  1. Well said! Question is where to go and how to evaluate an alternative.

    ReplyDelete
    Replies
    1. There are a number of alternative providers. Indeed, if you're not aversed to cloud, evaluating alternatives can be done through the use of free accounts with AWS or MS Azure. They give you 12 months free on tiny usage tiers for you to try out the different models. You can use anything on the free tier stack. You can use larger compute models, but the months worth of usage just gets used up quicker, in the case of AWS anyway. Azure Website have a different model.

      To evaluate it, the key is to use your own website but don't move your domain until you've finally settled on someone. After all, that is what you want to ultimately do with it, so there is no sense starting from scratch. If you are running an e-commerce shop, you'll want to consider running a few anonymised transactions through it at the volume you're seeing in live (your IIS logs will give you that). If you need to process credit card info, attach to your credit card provider's sandbox environment (such as PayPal, HSBC, WorldPay etc.).

      If you did go to the cloud, whichever provider you use, select the elements/functions you want, launch your platform and roll. The key is to see how those platforms work at all appropriate levels. For example, if you like to take control of the VMs in AWS, then create your VM, create an Amazon Machine Image (AMI) from it, store it in S3, or backup your live site to Glacier (the latter will cost you 1 cent per GB per month, even off the free tier) and if your site ever goes down, you can just rehydrate the snapshot/backup and you're up again in 20 minutes or less.

      For me, I am moving to AWS, simply as personal preference. AWS allows you to manage the low level VMs, even if you deploy through a mechanism such as elastic beanstalk. You can RDP to those boxes if you want to and in the case of receiving email, it is something you'll have to do to install an inbound mail server if you want to cost that in the cloud. I am ironically using MailEnable, which is one of the email platforms Netcetera use. It's free, is easy to install and configure. There are a few Windows Server steps you'll have to do, but nothing you can't do on Windows 7/8 if you are concerned about skill-set. Be aware that it's useful to score the services with a card that contains criteria that are important to you (with or without different weights if you want different levels of importance).

      At all points, be aware of your own website's structure, the overall system's potential failure points (again including both the websites/apps, the hardware platform and any of your own supporting processes. For example, TTRs etc.) sometimes, if you're app currently uses several servers, you might want to look at how cloud hosting it could save you money or creates potential pain points. It's not been a problem in my case but different apps have different contextual requirements.

      Delete

Whadda ya say?