10 August Dallas outage report


We posted a report on the Dallas power outage on 10 August CST.

We are happy to update that report or contribute details here if you had any comments or requests.

[Updated the notice at Sun 14@1008 UTC to include the data center outage report]


28 responses to “10 August Dallas outage report”

  1. We’ve had a pretty active site hosted by RimuHosting for more than 5 years. This is the first time that I was aware of any unscheduled downtime. Yeah, we had some complaints. But everyone understood when they got an explanation. No one is unhappy or complaining on day two.

    I think you guys do such a fantastic job. No one wants to see downtime but you were way overdue. I’m really happy that you started posting on Twitter when the problems happened – that’s the first place I looked and it helped to see messages. It allowed me to post our own Twitter messages to inform our users about what was going on.

    All in all, no big deal. I don’t expect or want any credit, refund, and even any additional explanation. I think you guys obviously took it all seriously and worked as hard as possible to get it all working again.

    Any company can be great when everything goes well. Companies show their character when things go wrong. You guys are the best in my book.

  2. Wow Jeffrey. Thanks for your response. We have such great customers! I know that the whole team here has been working so hard to try and do their best in the situation we’ve found ourselves.

    We’re open to your advise, criticism and feedback.

  3. For me it was around 4 years without an outage. I believe EC2 and Gmail have both had 2 outages in that time.

    My experience was I needed to kick MySql, no biggie – I will try and review that sometime.

    Keep up the good work.

    Cheers, Rob.

    • I think we worked out it was the first power or even major outage at that data center ever, in the 7 years we have been with them. I do not recall any other outages off the top of my head. that leaves us technically not too far off a 99.999% uptime even including the outage we had :)

  4. Hey, just want to comment. I agree with the first post.

    I’m the sys admin of the server we have hosted with Rimu. This is my opinion, and not that of my employers.

    We have been with Rimu for years. We had many problems with previous hosts.

    This is the first outage I can remember having since being with Rimu – which is impressive in my book. No service will be up 100% of the time – things go wrong. It is my responsibilty to mitigate that risk.

    The posts on Facebook alerted me to the problems. I appreciate the transparency. I’m on holiday, but I was able to notify my users of the outage and keep them up to date due to your updates via Twitter and Google+, etc.

    I think Rimu’s support is outstanding – they go well and above what is expected. And that is why I recommend them to others. And I still will.

    I popped into the Live Chat area last night, and I could see Rimu still working hard – kids at work in the evening and pizza while they were still working. Rimu were going the extra mile when the chips were down on them.

    Thanks, Rimu, for your effort – I appreciate it.

    Cheers,

    Glen

    • Thank you very much Glen. We do try our best to keep things going, and if all else fails (and the power) we let as many people know exactly what was going on so they could pass that down.

  5. How many other hosts besides Rimu were affected by this outage?

    Any idea how many websites – the ones you handle plus this other hosts handle – were affected?

  6. I agree with the first posting: Rimu support and response to the outage was superb. I am a new customer and I respect the comments from those who have been here longer. Personally, I find the response from Colo4 to be substandard, and it’s all the more aggravating because if you read their whitepaper on power, you definitely come away with the understanding that what happened is more or less impossible.

    Getting truly uninterruptible power is expensive, and it’s a hidden expense. There is a great temptation to cut corners in areas like that. Not Rimu’s fault, and I know there will be new resources put in place by the excellent Rimu team to deal with such an incident in the future. Down for 2 hours? Understandable, but difficult. Down for 17 or so? Not acceptable, and Colo4 should take appropriate responsibility. Thanks for reading, and kudos to Rimu for the herculean efforts!

  7. I agree with the customers above. Actually, reading how you responded to the outage, what kind of issues you dealt with made me like RimuHosting even more. Thanks for the great service.

  8. Generally agreeing with the above posters. I went through about a hosting company a year for many years, before settling with Rimu for a few years now.

    BUT, yeah, some of the single points of failure seem obvious in retrospect, like “One of the things that was unavailable was our DNS management UI/API. This prevented customers from changing their DNS records from unresponsive servers to running servers.”, and “We need better fault tolerance and failover on our RimuHosting core infrastructure servers. So we are better able to communicate with customers.”

    I think that my post on Facebook about something like StandingCloud for the “communication” portions, at least, of rimuhosting.com was too casually dismissed. Rimuhosting.com should never go down, and I believe that StandingCloud is the ultimate solution. The social networks are nice for those who use them, but I would much prefer to be able to get updates and chat at rimuhosting.com.

    Thanks for doing all you do.

  9. Bio degradeable material hit the fan thoroughly this time. Given your preconditions I think you handled it well & professionally. You’re good people to have in the boat when it storms, though I am at this stage a bit less impressed with Colo4.

    One of our three servers had unaffected uptime (although disconnected from the network). Was this due to it being on correctly connected A/B (74.50.62.120)? Can we pay extra to have a server connected to A/B? Will Colo4 test so A/B actually goes to unique sources? I never saw A/B listed as an option ordering a server, is it only for routers etc?

    Also, there were rumors on a webhostingtalk forum that there would be a 2nd downtime when they did some repairs to the damaged powerthingie. Is that true?

    Thanks,
    Marcus

  10. The fact that the RimuHosting web site was down is troubling, since there was no way to get news.

    Colo4’s web site was also down.

    MY DNS at RIMU was INACCESSIBLE, so I could not switch over to backup server in Ezzi Data Center in NY, even if I wanted to.

    Having ALL THE EGGS in one basket is very problematic.

    RimuHostuing needs TWO reliable data centers in the US, not just one.

    Ezzi is not reliable – that is why Rimu abandoned it. Now, Colo4 is also in question.

    Why was the new Colo4 ATS unit not ALREADY INSTALLED, waiting to be turned on, when needed?

    Hugh

  11. I have been a loyal, happy customer of Rimuhosting for a really long time — maybe even from the first year (2001?), in NY and Dallas. I still am :)

    I think you handled everything so well — especially communicating with your customers over so many channels with real information instead of some standard corporate talk — I appreciate that. I have always appreciated the amazing level of personal customer care you provide. The power outage event really shows what your company/our hosting company is made of, and I am still a happy loyal customer. I don’t know how I would deal with (my) customers (eek!) but actually I think your responses are a good model, and I hope that my future customers (for one site newly re-launched and another one soon) will be understanding too!

    Some things moving forward that would be helpful:
    1. How do I know or check some services are running ok, even if they are back up? For example, if mysql and postgres just restarts, is it ok or do I have to check further?

    2. I run/help with 4 servers on Rimuhosting. How do we make things more robust without spending a lot of money on duplicating service around the world? For example your temporary page — it would be good for any site to have a fallback in case the main server is not up. — a howto on throwing a couple switches and creating a temporary host site would be good (like change your DNS setting and post the page on the temp site).

    3. I am not so confident of the Dallas facility as I am of Rimuhosting…maybe have a second basket for all your eggs?

    Thank you Peter, Carl, Glenn, Liz, Elton, Felicisimo and everyone!

    All the Best,
    Daniel

  12. I appreciate that this was a horrible event to wake up to and that the Rimuhosting staff scrambled to correct the problems. I recognize also that Rimuhosting has been at the mercy of third parties with whom it contracts to provide service to its clients/customers. There are only so many variables one can reasonably manage and failures will happen from time to time — just the nature of the beast.

    However, one thing I don’t understand is how — short of a direct lightning strike — a power event could corrupt some servers. It’s as if this Dallas data center has no provision for UPS battery backup so that servers could be brought down (relatively) cleanly.

    I’m getting the impression that many servers were simply deprived of power on the spot — no clean shutdown process. As a result, the file systems (and some VPS’s) may have been corrupted.

    I don’t expect uninterrupted power, but I do think it’s reasonable that secondary power be available at least long enough for on-site technicians to manually shut down the servers to ensure a better start-up process once power has been restored.

    If I’m mistaken and if battery backup was in place, then what led to the inability of a number of servers to come back up? (This question excludes config-related issues such as SSL certs and the like.) If I’m not mistaken, what measures are being taken to correct this so that future power events can be handled more gracefully by this data center?

    Thanks for your work and for providing direct and straightforward status reports to your customers.

  13. Overall a great effort in recovering from a difficult outage. Clearly bullet proof redundancy for power supplies is easier to talk about than implement.

    Once Colo4 have completed their review and root causes / improvements are identified, it would be good if you could share a summary.

  14. Hi. I also agree with the first post. My only comment would be: it would be nice if you have your blog/DNS in a different network than your main website, because when the datacenter went down everything went down. I could not find any notice anywhere as all your websites were unavailable. (Also, I didn’t find anything in Twitter, there would be a good place to post update notices).

    Regards, MV.

  15. 1. I don’t buy the it wasn’t the heat argument. I was in Dallas 4 weeks ago and the heat was extreme and it has not gotten any better in the meantime. Not enough to kill the facility but enough to pick off the odd weakling in the oldest part of the facility.
    2. This incident followed a 3hr outage earlier in the week when a PDU died
    3. Mine is one of the servers that stayed up and my logfiles indicate the trouble started at least an hour before COLO4 admitted to it on their status page. (mental pictures of chickens with heads missing) The fact that rimu lost multiple power supplies suggests there must have been a pretty good bang/power spike.
    3. Given there was a big power spike I would be concerned about the reliability going forwards of many of the servers, routers and other stuff. I have a collection of dead PC’s from the time a few years ago when there was a machine shop with large lathes in the building for about 24 months. Both before and after there have been no problems but while they were here even top quality UPS’s couldn’t stop the fatalities.

  16. I totally agree with Kudo’s to the Rimu Team. It was obvious from the traffic that you were in the middle of it. I have been more than pleased with the reliability & support from RImuhosting. Obviously, your team knows what did and did not work on a “black” startup and will work to fix those things.

    I would suggest giving some consideration to planning client communcations in a disaster situation such as this. As is always the case when watching disasters from the sidelines, the information never comes often enough or accurately enough. I became a fan of Twitter for these types of episodes. It seems that campfire was good but got overloaded? I guess my biggest suggestion is that some planning should be given to how & what to communicate so that it is a little more structured. A lot of people seemed to be taking your information as absolute (based on Rimu’s great support track record). Only to find out that the “7PM ETA” was for power restoration not server restoration which came several hours later. Getting the right information out in the “fog of battle” can make the difference between a minor disaster and a major fiasco.

    One of the key issues that drew me to Rimu and has kept me here is that the support is up front and does not pull any punches. I am not suggesting you try to script a disaster. I am suggesting that you plan how information is gathered, verified and distributed. If you can’t verify a piece of information but feel it needs to be distributed be clear that the information is tenative, etc.

    According to the information distributed during the event, Colo4 has installed/rigged a temporary ATS and will need to replace that ATS which means another power outage. Can we get some updates on when and how long that will be?

    Is someone doing a root cause analysis to determine what happened with the ATS and what needs to be done to prevent a recurrance?

  17. > Only to find out that the “7PM ETA” was for power restoration not server restoration which came several hours later

    Perhaps I’m (together with Peter Melling) in the minority here, but my server (VPS) also kept alive during the outage. My first apache log entry is at 23:18 UTC (that would be 5:18PM at CDT, I think).

  18. Been a happy customer since 2007. These things are unfortunate. Rimuhosting’s support in the past when I needed it was top notch. Its one of the reasons why I haven’t moved my site elsewhere.

  19. I just got up after actually getting some decent sleep, and reading through the comments here.

    The power outage was something we had no control over, neither the fixing off it, so it was particularly hard for us since we were unable to do much other than try and find out information.

    Once the power came on we realized our core router (which has both A/B redundant power feeds) was down, and this should never have gone down. Once that was back online , and we rebooted the switches, a majority of customers were back online.
    At last we were able actually do something, the worst part is being down and being unable to do anything about it.

    Most people were up fairly quickly at this point, but some had dead HDDs, or similar and took up to another 24 hours or more to get up and running. One of the host servers had no network connectivity due to lost configs which took some time to bring online.

    Whilst the outage was bad, and we will be looking into what happened more, I do not ever recall any other outages at Dallas in the entire time i have worked at Rimuhosting (many years now).
    We are going to be looking into what we can do to prevent future, we have already talked a few ideas over in the brief breaks we had.

    You probably also need to ask what you can do to setup your own fail over. There are several methods available from VPS duplication, to a cheap/free hosting and DNS changes. i will make a blog post up in the next couple of days and try and cover a few methods for you.

  20. Also well done to Rimu’s handling of it. Unfortunately I was in a vulnerable spot with a client last week and this outage provided the ammo one of their employees needed to push them into the arms of another vendor, and I’ve lost some good passive income as a result. :-(

    Nothing to be done about it now, but I am interested in Rimu’s mitigating actions for the future. Regarding A/B power, one of my VPSes lost power and the other didn’t. I’d like to see a way to ensure that my VPSes are all on servers with A/B hookups. Any chance we could see that status in the move tool?

  21. I’m a little late posting to this blog, but I hope that “better late than never” applies here.

    My biggest frustration was the lack of communication when things started happening. I love RimuHosting support, *BUT* it was initially impossible to get information from RimuHosting due to the nature of this outage. When Rimu’s own servers are down, that means no live chat, no email support, and of course you don’t offer any phone support. I don’t recall the exact sequence, but I actually went looking for Peter’s personal Facebook account with the hope of finding personal contact information for him. After sending Peter a Facebook message, I stumbled upon individual RimuHosting support people using their personal accounts to communicate.

    I suspect RimuHosting will be quicker to use Facebook & Twitter in the future, but I think it’s reasonable to ask for PROACTIVE communication. When something major happens, or you even think it’s happening, you should be able to send emails, SMS messages, & perhaps even recorded telephone messages to anyone who wants it. I’d also rather get a “false positive” notification that might not really impact me.

    None of us want our customers telling us about a problem. We’d rather hear it directly from RimuHosting as quickly as possible, so we might enact our own fail-over plans. Facebook or Twitter are nice for backup communication in general, but they still require me to go monitor that. Please provide a proactive notification. Email is probably fine, as long as I can give multiple email addresses (to include one that will turn into an SMS message).

    • Heya Tom,
      It was hard for us, because when it first happened there was few staff on duty, and their main concern was to find out what was going on, where and how. That information was not available to us at all at the time. We were in the same situation as you trying to find out what was going on. As soon as the alert went up, everyone available came on duty to help deal with it, and we worked 18 hours straight trying to deal with it.

      We are sorry it took some time to get information from our upstream, and pass that on. We will try and get that faster next time.

  22. I know this post is long after the power outage event, but my comments will be pertinent.

    I have seen the Dallas Data Center redundant power supplies fail before (several years ago there was a worse event). Rimuhosting always does a great job of responding to any outage, but they can’t prevent this from happening.

    In my opinion, if you really want 100% uptime, just get a second “backup” VPS or dedicated server in a data center far away (mine is in the UK Rimuhosting data center). It won’t be used much for failover, but you can rsync your own backups to it.

    Set up DNS failover such that your UK backup server is your *primary* DNS server, and your Dallas server (primary http server) runs a slave DNS server (I use bind9). Then make the UK server monitor your Dallas server, and let it trigger DNS failover when/if Dallas fails.

    When the UK server detects failure of your Dallas server, it changes the DNS A record to itself. That way your website won’t go offline more than a couple of minutes if Dallas fails. (Dallas could *disappear* from the map and this would work.)

    If you set it up right, your backup web server can even take orders from customers. At the very least, you will not lose your web presence.

    If someone needs help with this, contact me through .