Today I’m launching a new section of my website called ‘Chat with Chimpie’. My hopes are to ask the questions that I think many of us have but don’t know how to go about getting the answers to.
My first interview is with Josh Banks, Support Manager and Josh Loe, Customer Service Manager of HostGator. The interview is in response to the routing incident at ThePlanet on Monday (May 3, 2010). Thousands of websites were affected. Some sites appeared to be inaccessible while some end users were able to view them just fine. Within moments of the router going down, Twitter and other forms of social media lit up with people screaming about their servers failing. There were lots of misinformation being spread around, so I wanted to give HostGator, one of the hosting companies affected by the outage, a chance to explain what happened.
Early Monday morning there was a routing issue between ThePlanet’s Houston and Dallas datacenters. Can you explain what happened in a little more detail?
ThePlanet has determined that there was an issue with a faulty line card. This line card was suspected to be the contributor to both outages at 2am and 8am. This is really the only information at this time we have from ThePlanet.
Was this a software or hardware issue?
This issue was caused due to hardware failure.
In the beginning, it was Tweeted, “Currently all servers at ThePlanet are down. This includes our chat and ticket servers. We will keep you posted.” Was this an accurate tweet? Were the servers really ‘down’?
Yes and no. I tweeted that because I was informed that all our servers were down because our office could not connect to them. Certain ISPs around the world could still connect, however our office could not. This was corrected in a few minutes on Twitter as we learned more about the situation. During the entire networking issue the servers never actually lost power. When trying to stay on top and keep everyone updated on the situation sometimes things are posted prematurely. I certainly learned a lesson from that.
Why were some websites working while some were not?
The majority of our shared servers are located in one of the five Dallas area datacenters. Our dedicated servers are spread out from Houston to Dallas, TX. The reason some websites were working and others were not was simply because of the routing issue. Some ISPs were able to see the servers, and some were not.
As the routing issue was resolved, you tweeted, “Routes may still be reconverging.” What does this mean?
Routing convergence is the time it takes for BGP (border gateway protocol) updates to filter throughout the world. This can be compared to DNS propagation, however it is much quicker, usually 3-5 minutes unless something is still broken.
Why did HostGator’s phone, chat and ticket systems go down?
Our Phone, Chat and Ticket system are all hosted on servers at ThePlanet. While we still were receiving phones, chats and tickets from people who were able to see the servers, our Houston headquarters was not able to reach the systems. You would log into live chat and see 265 people in front of you with zero agents taking chats, when actually there was a full staff here at our office.
Why did HostGator’s website and forum stay up?
The website and forums stayed up for certain people, just as the phones, chat and ticket system stayed up. It was all depending your ISP.
How beneficial has Twitter been to you over the past year?
Twitter is very beneficial to us here at HostGator. It allows us to communicate with our customers in a more personal manner. It also allows us to notice common issues quicker, because as we all know, if there is a wide spread issue with anything the Twitter community will be all over it. The past month I have taken over the spot of Customer Service Manager here at HostGator and part of that responsibility was our Internet presence. This consists of Twitter, Facebook and all forums that are related to hosting.
What other forms of medium do you use to notify customers of large outages?
What should your customers have taken away from this?
What our customers should have taken from this is that we are dedicated to keeping them updated in situations like this. One of the biggest things that we stress is that we are a very honest company. We have never tried to hide any sort of outage, compromised servers or deceive anyone when it comes to our network and services. One of the biggest things that I can suggest to all customers is, join Twitter and follow us. It is an amazing network of people and even if there is an outage here, we will be able to keep status updates.
Chimpie’s final thoughts
While we as customers hate to see our sites inaccessible, I know the folks at HostGator really had a lot on their plate when the line card failed. Knowing that your customers are affected and knowing that there is little you can do must be frustrating. And knowing that your customers want to talk to you, and you want to talk to them, but you can’t, must be even worse.
When outages occur, we as customers have to be patient. Most issues can be resolved very quickly. Some cannot. We are at the mercy of our hosts. There are some things we can do to help work around these issues, and I’ll cover that in another post.
I would like to to take a moment and thank Josh & Josh for taking the time to answer some questions about this outage. I know I learned a little more about hosting and I hope you all did as well.