the meat grinder


Having just been on a CAB call with over 60 people running through a list of over 400 items to work through I’m reminded why I try really hard not to work in places where these things happen.

When you work in a truly agile workflow we don’t need these. We don’t need these because a good agile workflow can fully replace a traditional CAB if the agile workflow is cross-disciplined throughout the business.

At the end of several hours, everyone who was speaking sounded dispirited and thoroughly pissed off – including the leader on the call. That sucks. If you’re the leader and the meeting is depressing you – imagine what everyone else feels like.

Seriously – stop having CABs but if you really have to have one these points may help you:

  • Circulate the CAB items early and anything with a LOW to NONE impact rating shouldn’t be discussed – they should be automatically approved unless someone wants to call out one of them during the CAB.
  • Don’t have CABs that last more than an hour at the very most.
  • Group your changes by impacted areas so you can release people quickly.
  • Don’t speak over someone when they’re speaking – especially if you’re leading the CAB.
  • Don’t get pissed off at people on the call – that’s unprofessional and upsets everyone on the call.
  • Build a cadence to your voice and maintain it. Humans take their cues from a leader of a group – be a good positive leader, not one that sounds like they don’t want to be there.
  • Use a good online communication tool that works for everyone – bad quality voice or video adds an extra cognitive load where enough already exists.
  • Stop having CABs. Seriously.

Upgrading a RHEL7 Azure Image


Why We Do This

The default image provided as the RHEL7 image from Azure Marketplace is woefully out of date. The kernel is 3.x and has some critical CIFS bugs which will cause issues with Azure Files. A number of critical sub-systems are out of date and need to be upgraded.

What Could Go BANG!

The kernel is NOT RedHat and so you may get push back from RedHat on support for this kernel if you start hitting edge cases. However, we use the cloud. If you’re tickling kernel bugs and no one else on the Internet is – you’re doing something horribly wrong. Rebuild.

3rd party software support may also push back on non-standard kernel. Push back on the push back!

Safety First

If in doubt, please backup your VM before you do this. If this fails it’s VERY hard to get the machine back into a production state.

How we do this

Update the base platform from the RedHat Repo

  • Login and sudo su to root.
  • Change to the \ directory
  • Update yum with yum update && yum upgrade
  • Accept the defaults and allow the update to happen – this may take some time, do something else in the meantime.
  • Install yum install yum-utils and then run package-cleanup --oldkernels

Upgrade the kernel from elrepo

  • Import the ELRepo public key with rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
  • Install the RHEL7 repo with rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
  • Install the new kernel with yum --enablerepo=elrepo-kernel install kernel-ml
  • Check which menuentry is the new kernel with awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg and make a note of that number
  • Set that number as the default kernel with grub2-set-default <<NUMBER>>
  • Regenerate grub with grub2-mkconfig -o /boot/grub2/grub.cfg

Update any encryption patches and dracut

  • Create a file called yumupdatefix.sh in /root with the following contents
#!/usr/bin/env bash

# get and test path to source of most recent install
unset -v ADE_SOURCE
ADE_SEARCH=/var/lib/waagent/Microsoft.Azure.Security.AzureDiskEncryptionForLinux*
for i in $ADE_SEARCH; do
  if [[ -d "$i" ]]; then
    [[ "$i" -nt $ADE_SOURCE ]] && ADE_SOURCE=$i
  fi
done

if [[ ! -d "${ADE_SOURCE}" ]]; then
  echo "patch failed - no source directory found matching ${ADE_SEARCH}"
  exit 1
fi

# get and test path to patch file
ADE_PATCH=${ADE_SOURCE}/main/oscrypto/rhel_72/encryptpatches/rhel_72_dracut.patch
if [[ ! -f "${ADE_PATCH}" ]]; then
  echo "patch failed - no patch file found matching ${ADE_PATCH}"
fi

# replace string {Encrypted_root_partition} by partition number (=2 in rhel azure gallery images)
sed -i.bak s/ENCRYPTED_DISK_PARTITION/2/ "${ADE_PATCH}"

# patch and run
bash -c "set -e; patch -b -d /usr/lib/dracut/modules.d/90crypt -p1 < ${ADE_PATCH}"
/usr/sbin/dracut -I ntfs-3g -f -v --kver `grubby --default-kernel | sed 's|/boot/vmlinuz-||g'`

The above script will generate a new initramfs image corresponding to the new kernel version with the patched up version of Dracut modules – getting warnings is FINE. Really.

  • Enable execution of this script with chmod a+x yumupdatefix.sh
  • Execute this script with ./yumupdatefix.sh
  • Ensure that the image it builds is the same version of the kernel-ml that was installed.
  • Reboot with reboot
  • Pray or even Prey
  • When the machine comes back up, it should take a few minutes, check the new kernel with uname -msr
  • Profit.

References

http://elrepo.org/tiki/tiki-index.php
http://elrepo.org/tiki/kernel-ml
https://blogs.msdn.microsoft.com/azuresecurity/2017/07/13/applying-updates-to-a-encrypted-azure-iaas-red-hat-vm-using-yum-update/
https://gist.github.com/mayank88mahajan/38faf934c86b89ad766c4c16dcd5f4aa

 

Catching The Cat


On the subject of encryption that’s getting so much press, one of my friends asked me what the PM was thinking. While we don’t actually call each other up and thus I have no idea what’s she’s thinking at any given moment, I may have an idea of how she’s thinking:

Simply put I suspect it’s a call for options. When you come up against an intractable problem you begin with an impossible answer. It’s an old methodology.  One that most of us were taught in school because we grew up before we could Google for everything. To get people out of their comfort zone you have to push them in unreasonable directions. I expected various technology groups to come up with options but so far all we have is people screaming and making a lot of noise.

One of the best ways to stop insurgents operating in this or any country is to disrupt communications – that’s hard to do because of encryption.

Encryption is the mainstay of much of geopolitics, commerce and humanity from the dawn of the common era. It’s is a boon and a curse. Over the last fifty years we have become extremely dependent on it and its usefulness. However, all these technologies of today were invented in isolation from reality in a past sure of the goodness of all men.

The problem is that those charged with developing these protocols have become used to the constraints of the technology and we need to think beyond them. At one time these technologies were the privilege of the developed world. However as ubiquitous technology opens the doors to more and more people our enemies use these techniques against us. The answer so far from the technology community is “it can’t be done”, where “it” refers to back doors in encryption. That’s not an acceptable answer because it’s not addressing the question.

When the government drafts an outrageous bills it’s looking for constructive responses. It’s asking for more effort from the subject matter experts to evaluate the real objectives.

The very idea of encryption is because we don’t trust anyone. Thus it’s impossible to accept that we should allow those who work against us to use our own technology against us.

While the ultimate decryption key is a sharp knife to a nerve cluster, that kind of behavior applied wholesale leads to a dark and dismal future and isn’t always a viable option. We’re still waiting for our technology experts to come up with an answer but many seem so enamored with their toys they can’t see past them.

Encryption is a tool.

A tool is used to execute an answer to a question.

The question is Security.

Isn’t it?

There’s an interesting blog post by Mythic Beasts on why encryption is vital. They seem to be missing the point. Everyone knows encryption is vital to the continued economic deliverable of the Internet as well as basic technology security.  While this blog post is an obvious political statement, we were rather hoping for options. Turning around and telling the wider society that the cat is out of the bag and that’s just tough is a stupid and arrogant thing to do. We’ve unleashed this double edged sword and we can’t put it back in the sheath but we must have more of an answer if we’re not to look like complete idiots to the rest of society. Like a child that spills their milk but just pouts and won’t clean it up.

When we look for an unreasonable answer this kind of response isn’t wan’t we’re expecting from people who should know better how to handle intractable problems.

So far there’s been little option provided which seems to suggest everyone is happy with the knife and nerves option.

Which is dumb.

So here, in clear and plain terms, is the question:

Given that encryption is easy to acquire and utilize, given our enemies have the access to same technologies as us, what are the options available to our society to ensure we are able to disrupt encrypted command and communication channels our enemies use whilst maintaining our freedoms to use it?

We all know we can’t put the cat back in the bag. I refuse to believe that a trillion dollar discipline such as ours can’t come up with some feasible answers that don’t involve the road to perdition.

If this is too tough a question for us, perhaps we’re not really worth the fuss.

Engineering Small


Stay focused on the engineering reality defined by your capabilities and requirements. Line up the engineering tasks with the business value. A shift in the engineering tasks should be a conscious decision as a result of conversations between the technical and business leaders in order to better deliver on the business value. Technical debt can be discarded when the engineering objectives change. Remember all technology is transient and that pride cometh before a fall.

it’s never happened here


its never happened

There are somethings that I dislike. The image above is a screen capture from a conversation I was part of recently and said by someone who I respect deeply. On this point, we disagree deeply however.

Modern DevOps culture demands a great deal of pragmatism. In particular it demands that you don’t solve problems before they become problems. It’s the benefit of working in a non-critical environment. We’re told that ‘just-in-time’ is the best fit for fast iteration and product development and who can argue with success?  I wonder however if we’re losing a key component of good Engineering, that of forethought.

The conversation was around GitHub protected branches and whether they were a good idea for some/all our repo’s. I think these are a good idea for all repo’s and selected branches, especially master, and doesn’t take long to setup nor much in the way of maintenance or documentation. I can also understand my colleague’s position that there’s no need to go enable this everywhere as nothing bad has happened yet that would suggest this would be a good protective measure.

To me that’s a bit like putting the safety on after pulling the trigger. The thought’s there but the execution is suspect. It’s too Dev and not enough Ops for me.

In operations/SRE or whatever you call it these days, the paramount responsibility is safety. Safety leads to uptime. Uptime leads to sales. Sales lead to money. Money leads to wages. Wages lead to whiskey. Operations engineers are there to make sure that the business is supported and protected at all times. Operations engineers are there to ensure that engineering is able to deliver but that the what’s delivered is supportable and operable on an on-going basis – even if the entire engineering team changes. There is I think a fine distinction here between enabling the business and enabling engineering teams. Those two enablements often require subtle differences in approach, tact and execution. It’s the fine line that sometimes trips you up.

On this occasion I left this well alone, no one needs this kind of debate on a Friday and it’s a small thing. But small things have undone great enterprises before so in the back of my head a warning bell will sound all weekend! I like to think I allow teams I manage a great deal of autonomy and the operations team haven’t expressed any concerns. Whether that’s because it’s easier not to is up for debate – but not today.

senior platform engineer


I am an Engineer

I work for TES Global. TES Global’s story is an extraordinary one: its digital community on TES Connect is one of the fastest growing of any profession globally, and it boasts a 100-year heritage at the centre of the teaching and education community, with offices in London, San Francisco and Sydney. Today, TES Global has over 6.9 million teacher members in 197 countries across the globe and connects more than 72 million teachers and students. Up to 6.3 million resources are downloaded from the site a week, 13 a second. Home to more than 800,000 individually crafted teaching resources developed by teachers for teachers, this unparalleled collection helps to guide, inform and inspire educators around the world.

The operations team at TES plays a big part in ensuring our platform is kept alive and healthy as well as working very closely with a range of developers to deliver one of the smoothest delivery pipelines I’ve had the privilege to work with. I’m now looking to expand my team and am looking for an individual who has current permission to work in the UK and will not require VISA sponsorship to join us as a Senior Platform Engineer (SPE).

SPEs are responsible not only for the underlying hybrid cloud infrastructure but also for networking, security, performance, automation, integration and education of the wider internal developer community in the best practices of using our infrastructure and services. They are the last but one line of support and assistance and as such are expected to understand their subject matter deeply and thoroughly. It can be an incredibly challenging role but one that we all feel is worth doing and worth doing well. We all work in an Agile manner and believe in effective communication as the single most important tool at our disposal.

The TES platform lives within the AWS public cloud and an internal Xen based stack. SPEs are expected to be deeply familiar with both these stacks and to provide interfaces for developer operations that are seamless between the two.

We make extensive use of Docker, Ansible and Python and our SPEs are expected to be familiar with these technologies and to become extremely proficient with them. You’ll be supported each step of the way on that journey.

We work closely with the wider engineering teams to share our experiences and knowledge to continuously improve the technical operations procedures, tools and approaches throughout the business.

What’s it like to work here?

We release regularly and frequently here at TES and that means our systems and our infrastructure has to be as dynamic and robust as the engineers creating the features. This isn’t a 9 to 5 job and we work the job and not the clock. TES values it’s people highly so there are various social events regularly scheduled to help maintain a healthy atmosphere and a sense of community. We try to hire clever people to do simple things and so make the complex seem mundane.

We constantly look at ways we can improve how we work to cooperate and tune the process to the people rather than the other way around. We don’t believe in blindly following doctrine but encourage teams to arrive quickly and effectively on ways of working that help deliver the quality our customers have become accustomed to.

We’re no strangers to trying edge things and our CTO, Clifton Cunningham, allows us all the freedom to experiment responsibly in order to drive our innovation with many of us having worked with him a number of times before. There’s a reason for that.

Responsibilities

  • Providing input into any new solutions and for any enhancements of existing ones.
  • Working with Agile development teams in a web facing environment and for mentoring junior developers in the DevOps process.
  • Driving and owning relationships with multiple suppliers and internal teams.
  • Driving continuous improvement across multi-disciplined teams.
  • Help establish, renew and run processes for on-going maintenance and monitoring operations within the platform.
  • Providing out of hours on-call services on a rota basis.

Candidates should have experience of working in large scale, highly available enterprise environments and have demonstrable capabilities of contributing to a large scale python project. Experience should include some or all of the following technologies;

  • Ansible
  • Python
  • NodeJS
  • Cassandra
  • ElasticSearch
  • RabbitMQ
  • Redis
  • Logstash
  • Kibana
  • Bash
  • Linux Networking
  • Virtualization
  • Hadoop / RDS / MapReduce
  • Git / npm / dpkg / apt
  • Docker / Docker Swarm
  • Rancher / Messos / Marathon / Kube

Application Process

If you’re looking for somewhere to grow your skills and career and not just another job where you can hack then we’re probably going to get along fine. We’re very interested in any female engineers who might want to consider this role to help bring balance to the force.

There will be a technical test to pass as well as some tests that you won’t think a technical role would require. We’re looking for the whole of you and not just your technical kudos so come with an open mind.

If you’d like to talk about this role then please get in touch with me at khushil.dep@tesglobal.com with a current CV and GitHub account.

NO AGENCIES.