Thursday, December 20, 2007

Production changes

It's 3pm. You recieve an email demanding a bunch of changes to production systems. There's no plan attached, or any detail other than "apply things". Apply things turned out to be more complex in development when it was done, but still there's no plan or detail learnt from that attached to doing the same thing in production. And it's scheduled for 5pm.

Sometimes I really wonder why $customer doesn't understand that the reason their environment is often broken and misbehaving has something to do with How They Demand It Is Run.

Wednesday, December 5, 2007

How not to run a project.

Ahh big projects! We all love them.

It's where all the managers feel they have something to poke, whilst techies get to play with shiny new toys... or so it goes.

I'm working on one such big project.

$customer has decided the way to manage their big project (in the multiple billion dollar mark) is to flick it all off to a bunch of consulting firms, with little or no direction... and let it run.
Even better... they are replacing their entire business systems.

Now I might not be a big fat CEO or even a CIO but I think these are things you normally don't do on a project.

  • Define no project milestones or determine what is a success on the project.
  • Have no backout strategy
  • Ensure you can't run the new and old systems in parallel... due to EOL hardware on the old. Which won't support much of the firmware updates on attached gear.
  • Ensure you upgrade both your storage systems and backup software right in the middle of the data import. No chance of restoring.
  • Performance testing completes 3 weeks after production roll out.
  • All tools and processes will not work in the new environment. This includes monitoring, backup, and agents.
  • When hit with a major risk. Respond with 'continue as normal'.

I'm starting to take bets that this ends up in the papers and falls down in a screaming pile of you-know-what.

What do you think?

Thursday, November 29, 2007

Insert tab A into duck 7.

For the most part, documentation is something that you really want to see written and followed, and ideally kept up to date.

But what happens when you have bad documentation? You get hilarity.

$otherTeam was following documentation on how to set up their application. Alas, the documentation seemed to be written to assume no-one could ever resize disks, and required that the installer unmount /tmp and then symlink it deep into application land like /var/application/fluff/bits/things/tmp. We were using kerberos logins, which requires a writable /tmp. So $otherTeam unmounts /tmp and boom, no-one can log in any more. And then they exit their shell.

Oops.

and all the cake is gone

The project I'm currently being punished with has gone wrong in far too many ways to count, but today's is extra fun.

Throughout the project we've been plagued with DNS issues - the customer manages their own DNS in this environment, and the server I'm currently setting up has had it's IP address recycled from a recently decomissioned development box.

They updated the A record for the server when it was comissioned - but it seems they forgot all about reverse. For a long time the A record and the PTR didn't match, which caused all sorts of grief with software that expected the reverse DNS to resolve to the name they had in their configuration.

We finally managed to convince the customer to fix up this issue. Except that it seems they've painted themselves into a corner.

KB: "Windows 2003 AD does not allow you to delete DNS names with uppercase letters"

They made the entry in all uppercase, and it seems that means we're basically stuck with it. Even better, while stuffing around today they managed to make it not resolve to anything at all, so now kerberos is refusing to work. No-one can log in. At all.

I'm going to go and have some coffee since I can't actually get onto this box to do any work. Yay!

Tuesday, October 23, 2007

Bright and shiny.

Unusually my job has been relatively sane lately, leaving me a dearth of things to write about. Today though, my main customer made me extremely nervous. There's a big project coming up to implement a certain kind of database cluster that's designed in a way I can only describe as 'special'. For a start, they think iscsi is reliable and that a single path to the storage backend is acceptable in an otherwise fully redundant system.

Part of the project is going to include upgrading their current SAN. Something I wasn't going to mention in front of the customer is that which choice of SAN product is something I need to know as early as possible. Damningly, the SAN offerings recommended to them by my own company have serious known issues with the kind of systems I'm going to be deploying.

The customer representative looked at me quite seriously.

"We haven't decided yet. I'm going to see what all the vendors are offering and pick the shiniest."

I have a horrid feeling shiniest means "It had a full page spread in $industryMagazine!" and that this product is going to end up being bleeding edge enough to be a real devil to integrate.

Tuesday, September 18, 2007

New career time?

My last few weeks have been a hazy blur of working long hours and not getting enough sleep. My workload is rising to panic-inducing levels as a second customer elbows their way into my schedule. My manager wants to move me from a relatively sane customer to a really horrid demanding one, and while that would give me a lot to draw on for this blog really, I'd like a peaceful life. Honestly. I agree with Pratchett that 'May you live in interesting times' is one of the worst curses I can think of.

One incident stands out. There's a project that I'm not officially part of but I occasionally get phone calls or emails from those who are asking for a bit of help. One such phone call came in as I was trying to eat my breakfast at my desk. It seemed that one of the gentlemen down on $project, I'm going to call him Basil, has rendered a system non-booting.

Apparently Basil did this by editing the fstab to add a new mount. Add the entry, reboot the system (I'm not sure why this was necessary) and bang, Unable to mount root fs. While wandering around the office kitchen making a cup of miso and peeling my mandarin I tried talking him through recovering the system. First there was the appending boot options to grub drama. Then there was teaching him how to navigate when all he had was the initrd. I thought everyone knew that you could:

echo *

If you don't have a working ls.

Around the office people were smirking at my phone converstion which went something like this:

"Ok, so you mentioned LVM in the boot options so I guess your root filesystem is in LVM? Right, have a look in /dev/mapper to see what you have there. No, we already established that you don't have ls. Right, either try to tab complete or use 'echo *'. e-c-h-o... got it? Yep. Cool. So now lets try to mount your root filesystem. No, you don't have an fstab so you can't just type 'mount /'. You'll need to type mount, then the full path of the device you've found in /dev/mapper, then a mount point.... right, yep, then a mount point.... where are you up to ? Ok, now you type a mount point.... ok, just type '/mnt' for me? Ok. Good. Now hit enter. What's wrong? What error does it give you? I understand it's not working but can you please tell me what the mount command printed on your screen?"

"... ah yes, so correct spelling is not optional."

We eventually got a root filesystem mounted and he commented out the new mount he'd added to the fstab and managed to get the system booted. To this day I still can't figure out though how he broke it. He said he'd typoed the name of the mount point but I just can't see how, unless the typo was / $name, with a space. That would do it.

Basil isn't as stupid as this post makes him sound - he's actually a pretty smart guy, but in completely the wrong role on $project, which gives him plenty of opportunity to look extremely dumb. I think we'll be seeing more of Basil on here before the project is over.

Outage Window

I have a 2-hour outage window. Another company also needs to make changes at the same time, because I have control of half of the thing, and they have control of the other half. It's clear that we have two hours to make the changes. The outage begins, we both make our changes.

When I call to rollback the changes, inside the outage window, they announce they've gone home. And it'll be half an hour by car to get in to undo the changes.

Sometimes, it would just be easier if this was all done yourself.

Monday, September 3, 2007

One of those weeks.

It's been one of those days where every time I get up to go to the bathroom, I come back to 3 missed calls from 3 different people all wondering why their work isn't done yet (Hint: It's all the time I spend talking to you on the phone! If you left me alone think of all the extra time I'd have to do your work in.)

It's taken me 2 days so far to get access to an SSH gateway that allows me (eventually) into a certain customer's environment. For various reasons I need to get a 3GB database dump back to my local machine from a system nested behind 3 layers of NAT and only accessible through a certain chain of about 6 systems by SSH.

After constructing one of the most arcane ssh command lines I have ever seen, I discover that one server in the chain wont let me forward a port.

AllowTcpForwarding no

I think I'm going to burn someone. This particular server in the chain is really causing some grief for me given how ridiculously tightly it's locked down.

-bash: /bin/vi: Operation not permitted

Thanks guys. I really appreciate the way you help me do my job.

Sunday, August 26, 2007

The Six Million Dollar Shopping Cart

Corporate websites can be a minefield of managers and thinkers who honestly believe they know what they want, but very rarely seem to have actually used the Internet much at all.

Usually big changes to corporate websites follow a specific pattern:

1. PHB reads in a magazine that the hip new thing for companies to do is leverage A and B, in a mashed-up 2.5 Internet thingamajig. They've no idea what this all means, but hey, they don't want to be left behind in the rush of companies getting on the next big thing.
2. PHB tasks one of their minions to go forth and investigate this. Options are considered, and a recommendation made.
3. PHB ignores recomendation, and choses $vendor because they have a nice website and anyway, we have a special relationship with them.
4. Various things happen, eventually resulting in poor sysadmins deploying an ill-defined system that never really works.
5. Go back to 1.

Occasionally, however, someone new starts and the cycle is disrupted. Even more unusually, sometimes the ideas the new person has cross my desk. This is usually a bad thing for them.

Glancing over the new person's proposal for redeveloping the corporate website, it strikes me how flimsy the whole paper is. It starts off with explaining the product they've chosen (without any comparison with other products), how they intend to structure the site, and how they plan to palm off content management to business units. In the end, the whole thing is going to cost the company $6m in licences alone, not including my time to wrestle this product into something useful.

What they were trying to solve was the existing process was .. inelegant. I'm not going to defend it much, it wasn't a great system. It involved business units sending the content team a Word document with the changes they wanted, the content team polishing it up a bit, and then sending written up HTML fragments to an external company, who turned it all into the various bits of WebSphere rubbish required to do it. The whole process was ugly, but it more or less worked. I have no love for the backend either, but the quirks of it were well understood (ie, we restarted the thing regularly and it behaved okay then.)

The problem was that spending money with the external company wasn't seen in a good light. Hence, content must manageable by the company itself. Although we were only spending about $120k a year with that external company, so the "savings" from running it ourselves were an interesting hole in the paper. (If you do the math, it would take 50 years or so to recover just the license costs.)

Thankfully, this wasn't the only problem they'd identified. The process did involve a bit of lag, because of the content being handed from team to team. Making the business units manage their own content was going to help solve this. But no-one actually talked to the business units about their needs, and I had some idea what they were: this Internet thing is mostly a distraction, and they don't want to hire people with the skills needed to manage it. When confronted, the author of the paper even acknowledged this, and admitted that they would need to hire more people for the shared content team. Another saving in the making.

What they really wanted, above all else, was to allow people to select products from different business units and aggregating paying for them. A catalogue that spanned all of the business units. After killing off the "saving money" spin in the paper, they explained that making all of the products of the various business units would allow better cross-selling of different units products.

Sadly, this is what it boiled down to: It had nothing to do with saving money and everything to do with picking a very expensive framework and building a shopping cart out of it for a starting cost of $6m. I wasn't very liked after saying that. Nor for asking why they had selected the product, only to find their whole basis was "it was what we used at the last place". Some quality research there.

Sometimes, it's good papers cross my desk that people wish hadn't.

To log or not to log

Don't you love the corporate policy:
Internet facing systems should retain all log files for a minimum of 60 days.
Sounds great doesn't it? You could in theory then see what has been going on.


Even better, move the log files off these devices to a much more 'secure' loghost.

Okay, now we're talking!

So now we've got 30+ devices (firewalls, SMTP routers, proxy servers, socks proxies, etc) all logging to one box.
Just how much space do you think you'd need?

Let's just check:

Filesystem Size Mounted on
/dev/sdb1  68G  /var/log

So apparantly it's less than 70Gb.

Q: How much do we log daily?
A: ~12GB a day per device. (and yes they have turned on full debugging!)

Hmm... don't do the maths 30x12G...

Managers now wonder why we get paged out multiple times a night to fix the mess.
Easy answer you say: Add more disk.

You would think, it was raised 6 months ago.... and apparantly the purchase order was 'being raised'.
We've been given implicit instructions we are not allowed to delete anything, or even turn off the full debugging.

Even worse they box wasn't setup with LVM/RAID or anything remotely useful.

Q: So where are we now?
A: The 'work around' we've been instructed... copy the data onto other non-loghost production machines... so the machine constantly is now splattering logs constantly across a host of other machines. And no we haven't been able to use any networked mounted file systems... so it's scp'ing the stuff over (that no-one actually ever bothers to read anyway).

Not so sagely.

As Oracle start to clutch at straws more and more they really start to come up with some amazingly interesting solutions. Partition table issues? No problem.

dd if=/dev/zero of=/dev/sd? bs=1024 count=1000

I really wish I was kidding. Their suggestion for oracleasm still rejecting disks that it didn't like the partition table of?

oracleasm force-renamedisk /dev/

Saturday, August 18, 2007

Everything has to have a beginning

It happened to me this week.

I was making a routine scheduled change, and suddenly things started to go very, very badly. A database cluster in smoking ruins at my feet, caused by the conflict between what the sysadmin knows is realistic and what the customer thinks should be possible.

After all, they got it working in their Ubuntu virtual machine and that's practically the same as a production system, right?

Every sysadmin knows the moment. You watch the output scrolling across the screen and something nasty catches your eye. Your heart leaps into your throat and your stomach sinks. This moment is best described as "Oh shit".

This blog is for everyone who's ever had that moment at 3am sitting in a cold, dark server room squinting to see on the ancient CRT attached to the KVM. Let the war stories commence.