We had about a dozen reports about database connectivity issues this morning from Cerb4 On-Demand customers that all had a particular database server in common. The issue resolved itself last night and didn’t trigger any logs or network monitoring alerts (e.g. uptime, load). There wasn’t any data loss.

Our best theory at the moment is that by freak coincidence a bunch of customers followed the instructions we sent out last night to purge their spam histories at exactly the same time (we’re happy people found the instructions useful!).  The timing makes sense.  And even though we leave plenty of extra capacity on all our machines, it’s possible that this database server (MySQL) was “blocking” while waiting on the disk, which would have queued up connections until it hit a limit that stopped accepting new requests.  That’s entirely consistent with what people reported.

If that’s what happened then this shouldn’t be an issue that pops up very often.  The database has to do a bit of work to remove 100,000s of spam tickets at the same time when people aren’t in the habit of cleaning up more often.  That’s the exact reason we felt it was important to send out the best practices guide for Cerb4 anti-spam.  The unintended consequence was kind of a digital “bank run” to clean up databases.

While overloaded, mail just wasn’t being checked — it wasn’t lost.  Once the maintenance finished mail was caught up.

We’ll keep an extra eye on that particular hardware just to be on the safe side.  I could throw around some stats about our uptime 99.99% of the time, but those of you who were impacted by this know what our service and performance has been like up to this point. :)

If you have any questions about this, or anything, feel free to comment or open up a ticket from our website.

Thanks!
-Jeff@WGM

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]


12 Comments to “A brief *.cerb4.com database hiccup last night on one On-Demand server”

  1. Thomas | January 14th, 2009 at 9:49 pm

    Thanks for the explanation and handling so quickly. We’ve only been an On-Demand customer for a few months, but really happy with the response time and quickly resolution was achieved.

  2. Jeff Standen | January 15th, 2009 at 12:45 am

    Hey Thomas! I’m glad it didn’t ruin your day. I haven’t seen anything strange going on all day through our peak use times, so our theory was probably right. :)

    We’re happy to have you!
    -Jeff@WGM

  3. Mark | January 16th, 2009 at 5:08 am

    Appears that the perfect storm conditions is back this morning. Getting the same connection errors this morning that we had two days ago.

  4. Steven Dobson | January 16th, 2009 at 8:41 am

    Hi Jeff, we are a UK customer using your on demand service and only went live this Wednesday January 14th. So far out of the 3 working days we have only been able to use Cerberus 1 day out of 3. Not great and now I am under real pressure from my peers. I really need contacting to find out what I can do before I am told to find a more reliable solution. Hopefully this is just the worst possible timing but I’m sure you can see my companies concerns? Regards Steven

  5. James Morris | January 16th, 2009 at 8:49 am

    Granted this may not be related but we had a down-time of three hours on the 14th of our helpdesk System and despite leaving four voicemail messages and a logged call via your webpage I received no call. The problem has happened again today and we have lost another two hours. It has just righted itself but again we are losing valuable time. I receive no reply when ringing Sales or Support and every other mailbox is full. This is unacceptable and I am wondering what we are paying a fee for? Someone maybe ‘happy with the response time and resolution’ but I certainly am not.

  6. Jeff Standen | January 16th, 2009 at 2:37 pm

    Hey guys. We just found the real cause this morning. Even though the “perfect storm” scenario the other day didn’t help, the issue had to do with a specific helpdesk monopolizing the available database connections without a limit. I’ll do a new blog update on it. And I really apologize.

  7. Jeff Standen | January 16th, 2009 at 2:38 pm

    @Mark I do wish the “perfect storm” issue was a one-time thing, but I’m at least happy to know exactly what the cause was now. Not knowing yesterday, even though it was resolved for the time-being, was still unnerving.

  8. Jeff Standen | January 16th, 2009 at 2:44 pm

    @Steven I assure you it’s the worst possible timing. We haven’t had any server or network issues for a long stretch on these Cerb4 machines (feel free to ask around on the forums), but issues like this do tend to pop up in tribes. This issue is also isolated to a specific population of hosted customers (due to the abuse by a particular helpdesk), and we’d be more than happy to move you to a different server group on our network. That shouldn’t be necessary with the new cap in place on the offending account, but your team may find it reassuring. I’ll be doing a comprehensive blog post in a few minutes about it.

  9. Jeff Standen | January 16th, 2009 at 2:54 pm

    @James Our office is located in the Pacific timezone (Southern California). Do you happen to recall what times you were calling? We have people on standby 24/7/365, including myself, but this particular issue didn’t trigger any monitoring alerts. My blog post in a couple minutes will include a wrap-up of why that was. (Basically, connections were throttled while well under capacity due to abuse.)

    We’ll come up with a reliable emergency contact solution for our international clients. I guess on one hand it’s a good thing it hasn’t been necessary up to this point, but I’ll be the first to admit there’s no excuse for taking 1-2 hours to apply a 30 second fix.

  10. Cerberus Helpdesk Blog » Blog Archive » A post-mortem on the brief downtime experienced by a few On-Demand users this week | January 16th, 2009 at 5:34 pm

    [...] nights ago when the issue first happened, we believed it may have been an isolated coincidence of several users purging massive amounts of spam at the [...]

  11. James Morris | January 19th, 2009 at 1:32 am

    Hi Jeff,

    We were calling around 3-4pm which should be between 7-8am your time. I still haven’t received any phones calls or replies to calls logged via your web but if the problem is now fixed then that is ok.

    Thanks

    James

  12. Jeff Standen | January 19th, 2009 at 1:44 am

    @James
    Yeah, everything should be fixed up (there’s a subsequent blog post about it). We capped the vector that was abused. The client had actually maxed out their 2TB of transfer on their host (elsewhere), so there was definitely some denial-of-service action going on. They had us shut down their helpdesk, but at least we’re aware of that potential on the other machines, and the new machines we set up. Thanks for being patient, and I apologize we never called you back. I imagine once the main team got in that morning they had their hands full on tickets. Voicemail is a good way to let us know what’s going on, but we there are times we’re aware of an issue and just don’t have the time to call every single person back. Be sure to leave an e-mail address with your messages if you want to be added to the notifications list about something. :)

Leave a Comment

You must be logged in to post a comment.