Cerb4 On-Demand Planned Datacenter Maintenance (1 hour; September 7th at 1AM Pacific)

On-Demand September 3rd, 2009

posted by Jeff Standen

We just received this notice from SoftLayer (Seattle datacenter):

Maintenance ID: 4443
Date: September 07, 2009 (09/07/2009)
Start time (PDT): 01:00:00
End time (PDT): 02:00:00
Services affected: Public/Private Network
Device: FCR01.SEA01, BCR01.SEA01
Reason: IOS Upgrade
Location: SEA01 (Seattle, WA)
Duration: 1 hour

=================================================================
SoftLayer Engineers have identified a potentially service impacting bug in the IOS code currently
running on these routers. Due to continued debug messages and traceback errors appearing on
BCR01.SEA01, Engineers have scheduled a reboot of FCR01.SEA01 and BCR01.SEA01 to upgrade to the
latest IOS code. Due to the severity of this bug, the timeframe between this notice and the
maintenance window has been shortened.

During this maintenance, customers will notice a complete loss of connectivity to their servers on
both the frontend and the backend. It is recommended to fully disconnect iSCSI and NAS drives
during this maintenance window. While the upgrade duration is scheduled for 1 hour, we only expect
between 15-20 minutes of downtime as the routers reload and fully boot.

A notice will also be posted in the portal prior to the maintenance, during the maintenance
informing you of the progress, and after with completion details.

=================================================================

If you have any problems after this time frame with regard to connectivity, or if you have any questions regarding the
maintenance at any point, please open a ticket in the customer portal.

We appreciate your patience during this work and welcome any feedback.

Thank you,

Network Engineering

These maintenance windows have been dependable over the past several events.

-Jeff@WGM

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

Cerb4 On-Demand Planned Maintenance for 10-60 mins on July 31 2009 at 1AM PDT

Community, On-Demand July 28th, 2009

posted by Jeff Standen

We’ve received a network maintenance notice from one of our data centers (SoftLayer) for 1AM on Friday, July 31st.  The window is between 10-60 minutes and it will affect connectivity for a subset of Cerb4 On-Demand users.  This is part of ongoing network renovation at SoftLayer and the previous maintenance so far has been quick and painless.

Here’s a copy of the notice:

Date: 07/31/2009 (Friday)
Start time (PDT): 01:00:00 (1:00 AM)
End time (PDT): 02:00:00 (2:00 AM)
Services affected: Public network
Device: FAS02.SR01.SEA01
Location: Seattle, WA
Duration: 1 hour

===================================================
SoftLayer Engineers will be replacing the upstream front end
aggregate switch that provides connectivity to the rack level
switch to which your server is connected.

Customer Impact: During this maintenance, customer servers
will not be reachable on the public network. While the
maintenance window is set for an hour, we expect no longer
than 10 – 15 minutes of downtime.
===================================================

If you have any problems after this time frame with regard to connectivity, or if you have any questions regarding the maintenance at any point, please open a ticket in the customer portal.

We appreciate your patience during this work and welcome any feedback.

Thank you,

Network Engineering
Softlayer Technologies, Inc.

-Jeff@WGM

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

On-Demand Maintenance for 1 hour on July 18 (Saturday) at 1AM PDT

Community, On-Demand July 15th, 2009

posted by Jeff Standen

We received notice from SoftLayer that there will be some private network maintenance this Saturday for a 1 hour window starting at 1AM Pacific Time.  The notice only affects a couple of our On-Demand resources.  In the past, private network maintenance has been about 50/50 in affecting public network connectivity.

Here’s a copy of the notice we received:

Date: 07/18/2009
Start time (PDT): 01:00:00
End time (PDT): 02:00:00
Services affected: Back End / Private network
Location: Seattle, WA
Duration: 1 hour
=====================================
SoftLayer Engineers will be replacing the upstream
back end / private aggregate switch that provides
connectivity to the rack level switch to which
your server is connected.

During this maintenance, customers will not have
access to back end / private services (DNS,
updates, update servers, NAS, iSCSI, etc). While
the maintenance window is set for 1 hour, we
expect no longer than 10 – 15 minutes of downtime.
=====================================

Update:

Employee Response – 2009-Jul-15 18:47 (GMT-0800) [Update 2]
Due to unforeseen issues, Softlayer Network Engineers replaced fas01.sr01.sea01 starting at 5:59pm PDT after the device suffered multiple crashes after opening this ticket to scheduling the maintenance. At 6:06pm PDT the replacement switch finished booting and all connectivity was restored to the backend network. We apologize for any inconvenience this has caused.

- Network Engineering

-Jeff@WGM

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

Cerb4 On-Demand: Partial Network Unplanned Maintenance for 20 minutes

Community, On-Demand July 1st, 2009

posted by Jeff Standen

Hey guys,

We’ve been given about 45 minutes of warning that the datacenter (Softlayer) needs to perform maintenance on their power system.  This is going to affect one of our Cerb4 On-Demand machines tonight at midnight (Pacific Time) for about 20 minutes.  It will affect another Cerb2/Cerb3 On-Demand machine on Tuesday, July 7th at midnight (Pacific) for another 20 minute window.  This is an infrastructure issue and doesn’t involve hardware on our particular machines.

We apologize for the inconvenience.  We’ll be standing by to make sure everything comes back up properly for the affected clients A.S.A.P.

Thanks!
-Jeff@WGM

Update 1:22AM Pacific: The data center maintenance for the affected machine is complete.  The machine is doing a routine filesystem check (scheduled for the first reboot after each 6 months) and things should be back to normal.

Update 1:26AM Pacific: Everything is back to normal.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

Cerb4 On-Demand: Upstream Provider Maintenance (Sat, Apr 18 2009)

On-Demand April 18th, 2009

posted by Jeff Standen

SoftLayer has posted an announcement about some failing network hardware in their Seattle datacenter.  We have several boxes over there for our Cerb4 On-Demand service.  So far we’ve only noticed one of our machines having packet loss issues.

They were originally planning to swap the failing equipment at midnight (CDT), but the deteriorating situation has forced them to swap it immediately.  Their latest announcement says the maintenance started at 6:30PM CDT (4:30PM PDT) and should last about 15 minutes.

This is an upstream network issue and doesn’t involve the hardware on our machines; there is no impact to any data.

Keep an eye on our twitter for real-time updates:
http://twitter.com/cerb4

-Jeff@WGM

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

Partial Cerb4 On-Demand Unplanned Maintenance (~20 helpdesks)

On-Demand March 29th, 2009

posted by Jeff Standen

This morning (Sunday, March 29th 2009, Pacific) we had a RAID inconsistency issue on one of the newer On-Demand servers.  The system unmounted the drive, presumably to prevent corruption.  This caused unplanned downtime for about 20 helpdesks on our network.  There was no real risk of data loss (we don’t depend exclusively on the RAID).

It was a quick 5 minute fix to restore RAID integrity, but it took us longer than necessary to be notified about the issue.  Our monitoring on this particular machine wasn’t checking the integrity of the mounted partitions, just metrics like disk space (and a missing partition wasn’t considered at risk of becoming full).  The machine was still online and pingable, so it didn’t trigger other alerts. That’s something we’ll improve.

I apologize to the affected customers, and I’ll remind our team that we need to remain diligent about emergency weekend support even when there are zero issues for long stretches.  Even though we shouldn’t need to be told about issues, this is a reminder that we can’t count exclusively on automated monitoring.

Everything should be fine on that server now, no data was lost, and queued up mail has been delivered.  If you still see anything strange going on, let us know.

Thanks!
-Jeff@WGM

UPDATE 2009-03-29 2:30PM Pacific: The datacenter just sent the following note:
“Hello, the raid appears to be in perfect health however it appears that the raid battery needs replacement as I see logs of it charging and discharging. This can affect the performance of the raid array and should be replaced as soon as possible.”

We’re going to let them swap the RAID battery (BBU) this evening (Sunday) in a window between 5pm and 8pm Pacific which is fairly off-peak.  This will only affect the same ~20 customers whose helpdesk pings to an IP in the 208.43.xxx.xxx range.

UPDATE 2009-03-29 5:11PM Pacific: The machine has been brought down for a few minutes, as planned, to replace the failing BBU on the RAID controller.

UPDATE 2009-03-29 5:25PM Pacific: The machine is back online.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

Cerb4 (Build 890) is the Latest Stable Release: 4.1 Maintenance, On-Demand scalability

Community, Debate, Documentation, Mailbag, On-Demand, Tips & Tricks March 2nd, 2009

posted by Jeff Standen

4.1 (Build 890) is a quick maintenance release to address some minor issues reported by people who recently upgraded to the 4.1 release.  This update also includes some project-wide improvements to support the hosting multiple instances of Cerb4 from a single copy of the project files.  I’ll write more about that later.

Here’s a summary of the main fixes:

  • Fixed an issue with the new ACL (permissions) system where users without access to particular plugins (e.g. Watchers/Audit Log) could prevent those plugins from triggering for everyone.
  • Fixed an issue with Internet Explorer 6/7 related to non-English languages being defined as the browser locale.  Zend Framework 1.5 changed their locale auto-detection behavior and we inherited the bug.
  • Fixed a few more places where the permissions weren’t being restricted properly (mostly: close ticket, assign ticket).
  • Fixed the issue with the Support Center not allowing certain browsers to cache the resources (namely images); which often manifested more in SSL mode.  Things should be speedier again.
  • A handful of other things.

In addition to the bug fixes, there were also a few improvements.  The bigger improvements here don’t directly apply to day-to-day use of the app, but make room for more scalability (cloud computing, distributed installations):

  • Usability: You can now remove attachments that you had added to a ticket reply without having to discard and start over.  It’s silly browsers don’t provide this functionality since they’re the ones rendering the file upload control.
  • Platform: The /libs/devblocks/tmp directory has been moved to /storage/tmp to simplify permissions and scaling.  The ‘storage’ directory should be the only place on the filesystem needing write access.
  • Platform: the Cerb4 code now properly supports symlink installations.  We used realpath() in PHP previously which dereferenced virtual paths and led to problems.  The only things you need at the instance-level now is a framework.config.php file and the /storage/ directory; the rest can be symlinks to shared files.  This means you can cheaply share an APC cache between dozens of helpdesks per server.  We’ll write up some more instructions about this on the wiki.

Thanks for being a part of the Cerb4 community!

-Jeff@WGM

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

The Return of Free, Instant, Private, Hosted Evaluations for 4.1

Community, On-Demand, Tips & Tricks February 25th, 2009

posted by Jeff Standen

We finally had a chance to rebuild an instant evaluations feature on our website.  It’s something we offered in the past, but the stats we collected at the time showed a majority of evaluations never got to the point they actually had tickets in the system.  There was too much to set up first.  Our recent versions make it a lot easier to get things running from a fresh installation.

We’re going to give it another shot, because the shared online demo just gets too chaotic with everybody moving things around daily.  It’s a lot harder for new visitors to understand Cerb4 when they’re also trying to understand a lot of random content and arbitrary groups/buckets/etc.

It also didn’t help that the public demo was in lockdown so you couldn’t make major configuration changes to try things out.  That would make it really difficult for us to show off 4.1 features like custom fields, workspaces, and inbox filters; which all require access to configuration to custom-tailor a workflow.

Setting up the system to create installations on-the-fly was also helpful for pushing Cerb4 closer toward the “cloud” mentality for scaling.  We built in several new improvements to allow multiple installations to share a single copy of Cerb4, which should give another really good performance boost to our On-Demand service.  It’s useful to our development effort to have hundreds of concurrent copies of Cerb4 that we can optimize against; and while our existing On-Demand network is a constant source of performance data, it’s not a place we’re going to try out experimental improvements.

So, long story short, now anybody can get instant access to a private copy of Cerb4, hosted free on our network for 2 weeks, by visiting the following page on our website:
http://www.cerberusweb.com/tour/demo

The evaluation helpdesks are slightly different than our On-Demand trials, because the evaluation copies are never meant to handle real e-mail.  They won’t even try to deliver mail.  But with our ‘Simulator’ plugin (from Configuration) that shouldn’t be an issue since you can create an endless supply of incoming, realistic-enough mail.  And you can still reply to everything without worrying about test messages landing in real inboxes.

Enjoy!

-Jeff@WGM

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

On-Demand Unplanned Maintenance on 1 Server (~30 min)

Community, On-Demand February 17th, 2009

posted by Jeff Standen

About 15 minutes ago we picked up a monitoring alert on one of the Cerb4 On-Demand servers (Rearden.WGM).  We connected through the IPMI interface (remote console) and the machine is rebooting normally.  We’ll look at the cause of the reboot once it’s back online.  It’s currently running a filesystem check since it’s been a few months and we’re going to let it go ahead and finish that.  It should only take another 5 minutes, and I’ll post a status update.

Update 4:56PM PST: Everything came up normally.  We’re looking at the logs.  Performance might be impacted for 15 minutes while we cycle fresh backups.

Update 5:23PM PST: There’s nothing unusual in the logs, so if it happens again we’ll suspect hardware.  Last night’s backups have been shipped off-site a day early (as a precaution against backing up corruption) and we have fresh copy of everything on the local redundant storage.  Any inbound mail during the 15 minute reboot should be synced up over the next couple minutes as it’s re-delivered.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

A post-mortem on the brief downtime experienced by a few On-Demand users this week

On-Demand January 16th, 2009

posted by Jeff Standen

Twice this week (Jan 14th + Jan 16th), about 5% of our On-Demand users likely experienced a brief outage during our night shift (Los Angeles, GMT-8).

Specifically, these users were able to pull up their helpdesk by URL but they couldn’t log in and were given an error message about connecting to their database.

Two nights ago when the issue first happened, we believed it may have been an isolated coincidence of several users purging massive amounts of spam at the same time; synchronized due to some instructions we sent out a few hours before.  It seemed plausible, considering the issue came out of nowhere after months of smooth sailing.  That situation probably still played a part, but it wasn’t the whole story.

This morning we had several more reports of the exact same issue; but this time we managed to catch it in action. The affected database server had no limit on connections by an individual user, and a single helpdesk was hording all the available slots by creating thousands of connections per second through their public Support Center.  A classic “denial of service”. However, there’s no evidence it was malicious.

From time to time, a user will benchmark our servers during an evaluation.  With reasonable settings everything will be served up just fine.  With excessive settings, simulating thousands of simultaneous connects for a helpdesk that’s going to have 5 users, we’re supposed to be throttling that account from affecting the entire server.  For this particular set of machines any client was able to use all available connections.  For the past year or so we’ve apparently just had some very polite customers on that hardware.

The trick with throttling is to set it to where it will never be hit by normal peak usage, and only by some runaway process.  We’ve re-established a per-user connection limit that’s much higher than our average usage statistics show across our network, but still low enough to not allow an individual helpdesk to monopolize the entire database server.  The potential still exists for a concentrated attack to distribute across multiple sites, but we realistically have to build in failure somewhere.  We could raise the maximum connection limit so high that nobody is ever refused, but servers would still become so overloaded they’d cease responding anyway.  By failing at a reasonable point, the machines are still fully usable to find and correct the source of an attack.

Luckily, in this situation it was just a simple misconfiguration issue.  We aren’t having chronic performance problems, and once an issue like this occurs it’s not unexpected for it to rear its head a few times in quick succession before it’s caught.

I could do an entire blog post on the economics of high-availability, and I may go ahead and do that today while it’s on my mind.  The basic idea is that it costs at least 200-300% more to eliminate the last 4.5 hours of downtime in 99.95% availability than it costs to be up the other ~8,762 hours.  It requires duplicate hardware waiting for its 22.5 minutes of glory, on average, per month.  We’ve gone back and forth about cost vs. excessive reliability many times on our end, but we still feel you guys would rather pay a cheaper rate for 99.95% availability than paying several times as much to approach 99.999%.

All that said, our minimal downtime by design should be planned and at off-peak times. In this case it wasn’t planned or off-peak for a portion of our users, and was due to a misconfiguration on our end. We apologize for that.

We’ll do what it takes to prove we stand behind our service.  Feel free to leave comments with questions or thoughts.  You can also contact me directly: jeff (AT) webgroupmedia.com

Hopefully on the next few posts I’ll be writing under better circumstances! :)

-Jeff@WGM

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]