Sunday, August 26, 2007

The Six Million Dollar Shopping Cart

Corporate websites can be a minefield of managers and thinkers who honestly believe they know what they want, but very rarely seem to have actually used the Internet much at all.

Usually big changes to corporate websites follow a specific pattern:

1. PHB reads in a magazine that the hip new thing for companies to do is leverage A and B, in a mashed-up 2.5 Internet thingamajig. They've no idea what this all means, but hey, they don't want to be left behind in the rush of companies getting on the next big thing.
2. PHB tasks one of their minions to go forth and investigate this. Options are considered, and a recommendation made.
3. PHB ignores recomendation, and choses $vendor because they have a nice website and anyway, we have a special relationship with them.
4. Various things happen, eventually resulting in poor sysadmins deploying an ill-defined system that never really works.
5. Go back to 1.

Occasionally, however, someone new starts and the cycle is disrupted. Even more unusually, sometimes the ideas the new person has cross my desk. This is usually a bad thing for them.

Glancing over the new person's proposal for redeveloping the corporate website, it strikes me how flimsy the whole paper is. It starts off with explaining the product they've chosen (without any comparison with other products), how they intend to structure the site, and how they plan to palm off content management to business units. In the end, the whole thing is going to cost the company $6m in licences alone, not including my time to wrestle this product into something useful.

What they were trying to solve was the existing process was .. inelegant. I'm not going to defend it much, it wasn't a great system. It involved business units sending the content team a Word document with the changes they wanted, the content team polishing it up a bit, and then sending written up HTML fragments to an external company, who turned it all into the various bits of WebSphere rubbish required to do it. The whole process was ugly, but it more or less worked. I have no love for the backend either, but the quirks of it were well understood (ie, we restarted the thing regularly and it behaved okay then.)

The problem was that spending money with the external company wasn't seen in a good light. Hence, content must manageable by the company itself. Although we were only spending about $120k a year with that external company, so the "savings" from running it ourselves were an interesting hole in the paper. (If you do the math, it would take 50 years or so to recover just the license costs.)

Thankfully, this wasn't the only problem they'd identified. The process did involve a bit of lag, because of the content being handed from team to team. Making the business units manage their own content was going to help solve this. But no-one actually talked to the business units about their needs, and I had some idea what they were: this Internet thing is mostly a distraction, and they don't want to hire people with the skills needed to manage it. When confronted, the author of the paper even acknowledged this, and admitted that they would need to hire more people for the shared content team. Another saving in the making.

What they really wanted, above all else, was to allow people to select products from different business units and aggregating paying for them. A catalogue that spanned all of the business units. After killing off the "saving money" spin in the paper, they explained that making all of the products of the various business units would allow better cross-selling of different units products.

Sadly, this is what it boiled down to: It had nothing to do with saving money and everything to do with picking a very expensive framework and building a shopping cart out of it for a starting cost of $6m. I wasn't very liked after saying that. Nor for asking why they had selected the product, only to find their whole basis was "it was what we used at the last place". Some quality research there.

Sometimes, it's good papers cross my desk that people wish hadn't.

To log or not to log

Don't you love the corporate policy:
Internet facing systems should retain all log files for a minimum of 60 days.
Sounds great doesn't it? You could in theory then see what has been going on.


Even better, move the log files off these devices to a much more 'secure' loghost.

Okay, now we're talking!

So now we've got 30+ devices (firewalls, SMTP routers, proxy servers, socks proxies, etc) all logging to one box.
Just how much space do you think you'd need?

Let's just check:

Filesystem Size Mounted on
/dev/sdb1  68G  /var/log

So apparantly it's less than 70Gb.

Q: How much do we log daily?
A: ~12GB a day per device. (and yes they have turned on full debugging!)

Hmm... don't do the maths 30x12G...

Managers now wonder why we get paged out multiple times a night to fix the mess.
Easy answer you say: Add more disk.

You would think, it was raised 6 months ago.... and apparantly the purchase order was 'being raised'.
We've been given implicit instructions we are not allowed to delete anything, or even turn off the full debugging.

Even worse they box wasn't setup with LVM/RAID or anything remotely useful.

Q: So where are we now?
A: The 'work around' we've been instructed... copy the data onto other non-loghost production machines... so the machine constantly is now splattering logs constantly across a host of other machines. And no we haven't been able to use any networked mounted file systems... so it's scp'ing the stuff over (that no-one actually ever bothers to read anyway).

Not so sagely.

As Oracle start to clutch at straws more and more they really start to come up with some amazingly interesting solutions. Partition table issues? No problem.

dd if=/dev/zero of=/dev/sd? bs=1024 count=1000

I really wish I was kidding. Their suggestion for oracleasm still rejecting disks that it didn't like the partition table of?

oracleasm force-renamedisk /dev/

Saturday, August 18, 2007

Everything has to have a beginning

It happened to me this week.

I was making a routine scheduled change, and suddenly things started to go very, very badly. A database cluster in smoking ruins at my feet, caused by the conflict between what the sysadmin knows is realistic and what the customer thinks should be possible.

After all, they got it working in their Ubuntu virtual machine and that's practically the same as a production system, right?

Every sysadmin knows the moment. You watch the output scrolling across the screen and something nasty catches your eye. Your heart leaps into your throat and your stomach sinks. This moment is best described as "Oh shit".

This blog is for everyone who's ever had that moment at 3am sitting in a cold, dark server room squinting to see on the ancient CRT attached to the KVM. Let the war stories commence.