System administration is tough. HPC system administration is tougher. Most of the time you have to juggle computing, storage, networks, software and code in order to get the performance required (of course with 100% availability & security). Do you have an screwdriver and some chew gum?. Technology is only a leg of the tripod. You have to deal with users (who always want more), bosses (who always want to pay less) and consultants (who always want your money). People have to be sheph…sorry, managed properly. And don’t forget that all the big changes that you make to your HPC infrastructure can (and should) be treated as projects.
Although project management can be hell, well managed can be also the key to heaven (or at least to some inner zen-like peace). I’ve been managing a midsized HPC cluster for more than 10 years, and I’d like to give back some tips and tricks learned (most of the time, by trial/error or utmost failure) to make this challenging task lighter. The tips will be 50%/50% split between technology and management, and black humour will be all around.