Tuesday, March 25, 2008

Little piggie, little piggie let me in.

I love how BigCorp(tm) think it's a great idea to use a Windows domain controller (ADS/KRB5) to authenticate their Linux users against.

What a marvelous idea! It means we can all have a single password throughout the organization!

It sounds great in a perfect world, where:

  • Networks/interfaces don't fail.
  • Accounts are not locked out when a user attempts to autheticate more than once every 5 seconds (really nasty when attempting to do something like: for i in `cat hosts.txt`; do ssh $i /bin/something; done )
  • Machines and the DC don't always match up time (particularly across large subnets regions/physical locations.

The one that gets me....
  1. Lose connectivity to the subnet that contains the Windows Domain Controllers.
  2. Customer raises issue 'Can't login'.
  3. Customer expects us to 'fix the issue'.
  4. We can't even login (even on the console as root with a local password), as the pam config specifies it needs to check the KRB5 realms.
  5. Customer gets narky.
  6. Customer is aware of the issue, but refuses to acknowledge it as a problem.

The solution... sit it out until hopefully the network comes back. Failing that.. a reboot using the boot option of 'single'. That's if the customer allows you to reboot the machine.

The joys of corporate stupidity. *sigh*

Thursday, March 6, 2008

1+1= ?

16GB of swap space required.

15GB SSD as the only onboard disk.

Are you sure you don't see something wrong with this picture?

Wednesday, January 23, 2008

I just don't believe you

Somehow I find it very hard to believe that you did not realise at any point while creating the severity 2 ticket in our trouble ticketing system that this action was going to page out the sysadmin on call. I find it even harder to believe that you would be surprised they would get upset with you about this on finding out that the issue was not an urgent one but rather an on-going issue you'd been experiencing for months that you wanted some data collected on.

Right now if someone would invent me stab-over-ip I'd bake them cookies.

Monday, January 21, 2008

Epic Fail

I discovered today when I picked up the pager for my on-call duties this week that I've been deleted from a certain customer's trouble ticketing system, along with about 100 random users. This is going to make it slightly difficult to respond to tickets.

Only slightly.

Thursday, December 20, 2007

Production changes

It's 3pm. You recieve an email demanding a bunch of changes to production systems. There's no plan attached, or any detail other than "apply things". Apply things turned out to be more complex in development when it was done, but still there's no plan or detail learnt from that attached to doing the same thing in production. And it's scheduled for 5pm.

Sometimes I really wonder why $customer doesn't understand that the reason their environment is often broken and misbehaving has something to do with How They Demand It Is Run.

Wednesday, December 5, 2007

How not to run a project.

Ahh big projects! We all love them.

It's where all the managers feel they have something to poke, whilst techies get to play with shiny new toys... or so it goes.

I'm working on one such big project.

$customer has decided the way to manage their big project (in the multiple billion dollar mark) is to flick it all off to a bunch of consulting firms, with little or no direction... and let it run.
Even better... they are replacing their entire business systems.

Now I might not be a big fat CEO or even a CIO but I think these are things you normally don't do on a project.

  • Define no project milestones or determine what is a success on the project.
  • Have no backout strategy
  • Ensure you can't run the new and old systems in parallel... due to EOL hardware on the old. Which won't support much of the firmware updates on attached gear.
  • Ensure you upgrade both your storage systems and backup software right in the middle of the data import. No chance of restoring.
  • Performance testing completes 3 weeks after production roll out.
  • All tools and processes will not work in the new environment. This includes monitoring, backup, and agents.
  • When hit with a major risk. Respond with 'continue as normal'.

I'm starting to take bets that this ends up in the papers and falls down in a screaming pile of you-know-what.

What do you think?

Thursday, November 29, 2007

Insert tab A into duck 7.

For the most part, documentation is something that you really want to see written and followed, and ideally kept up to date.

But what happens when you have bad documentation? You get hilarity.

$otherTeam was following documentation on how to set up their application. Alas, the documentation seemed to be written to assume no-one could ever resize disks, and required that the installer unmount /tmp and then symlink it deep into application land like /var/application/fluff/bits/things/tmp. We were using kerberos logins, which requires a writable /tmp. So $otherTeam unmounts /tmp and boom, no-one can log in any more. And then they exit their shell.

Oops.