PeteGoo

Guardrails: Unlocking Autonomy through clarity and good leadership principles

2024-05-12T00:00:00+12:00

I thought I would take a break from my series on Incident Response to talk about something that I wrote for a previous role.

I had just landed my first VP of Engineering role and I was reflecting on the fact that I had been given a lot of autonomy in the past which had allowed me to grow and develop. I wanted to ensure that I was providing the same opportunities to my team, especially the leaders, and so I wrote this piece on Guardrails. The idea was that defining the edges of the road would allow us to move fast and confidently.

I started with defining those edges of the road but then realised it was missing the equivalent of road etiquette so it kind of morphed to include Leadership Principles. These borrow heavily from Netflix and Amazon. I’ve left it pretty much as I wrote it 3 years ago removing some company specifics.

Anyway, hopefully you find it useful.

Why Guardrails
Leadership Values
What you need my permission for
What you need to tell me
My commitment to you

Why Guardrails?

One of the things that I value the most is the autonomy to be able to take action when I see the need. I want all of the leaders in engineering to have this ability, in fact we need to be able to move fast as an organisation. However there are some considerations that we need to take into account in order to be able to do this confidently and safely.

Consistency
How do we know that leaders are applying the same level of consistency? How do we represent company values, goals, policies in a consistent way? Inconsistencies in the way we treat these topics can confuse, demotivate, silo, and disenfranchise our people.
Authority
At times it’s hard to know if you have the authority to make a decision. It should be clear what authority leaders have and, conversely, where they do not have the authority to make a decision. For example, you may wonder “can I sign a specific legal document, buy office equipment, be flexible with a few hours of leave, ask someone to spend a few days on an unplanned piece of work”?
Communication Silos
It’s sometime hard to know what things you should communicate up or out. “Do I need to tell others what I am doing / deciding / experiencing?” A lack of the right type of communication can create risk for the business, me, you or your team.

Leadership Values

Of course we have our company values, which everyone should know. The following are the principles and values that I believe are important when thinking about how to serve, lead and treat team members.

Everyone comes to work to do a good job
It’s really easy to forget, especially when you are frustrated or under pressure, that it’s extremely rare for someone to be intentionally obstructive or malicious. In fact I can’t think of a single occasion where I knew this be the case. We hire good people and we trust them. In all scenarios you encounter as a leader, try to remember that everyone around you is doing their best and came to work with the intention of succeeding. We all understand our blameless culture as it applies to Post-Mortems but it’s more than that. Blame has no part in constructive progress. Look for the systematic failures and use them to improve.
Create an environment where psychological safety exists for everyone
Psychological Safety is the highest correlating factor for effective teams. You have a responsibility to create an environment where our people feel safe to work every day, to learn from failure, to be their authentic selves and to grow. The foundations for psychological safety are mutual trust and respect.
Create an environment where intrinsic motivation thrives [source]
There are things that you can’t give people, intrinsic motivators that must come from within them. Mastery, Autonomy and Purpose are not something you can directly give to people. You can, however, create an environment where it is possible for them to get these motivators for themselves. How? Understanding what drives each person will help to optimise for their happiness where you can. If you know what they want to work on and what they want from their job then you will be better able to maximise this for mutual benefit.
Manage with Context, Not Control [source]
Read the Netflix description but basically: Embrace Strategy, Metrics, Assumptions, Objectives, Clearly-Defined Roles, Knowledge of the stakes and Transparency around decision making. You should question any time you are adding top-down decision-making, additional unnecessary management approvals, decision committees, and excessive planning. Process shouldn’t be valued more than results. Make sure you are consistently reiterating the “why” and if you don’t know, find out. Think about how to frame the “why” in the most approachable and appropriate way for your teams without changing the facts.
Beware of the desire to be right, to know, and to succeed
You will want to be right. You will want to know a subject that affects you. You will naturally want to succeed. It’s also ok to be wrong, to not know, and to fail but you will and arguably should desire better outcomes. Be careful though to recognise that nobody else should make you feel bad for being wrong, not knowing or failing. That’s a drive that comes from within and it’s constructive, not negative. There is no room for “you are wrong” or “You do not know”. This goes both ways and so this desire should not manifest in “I am right” or “I know” leaving no room for compromise or discussion. Once it is externalised in either direction, this desire can turn toxic.
Admit your mistakes
Admit when you are wrong and let your team see it. Talk about what you learned and don’t aim for infallibility.
Know your bias
Be aware of your biases. Own them and develop controls for them. Who are the people that give you good feedback? Rely on them to centre you where you know you need balance.
Ownership and Accountability
One of the reasons for a blameless culture is to optimise for ownership and accountability. In the presence of blame ownership and accountability take a back seat. If you see a need, take ownership and be accountable for it.
Process for process sake is never the answer
The agile manifesto tells us “Individuals and interactions over processes and tools”. Unnecessary processes add drag and cripple agility. We should endeavour to be light on prescribed processes and only apply additional process where required by compliance, productivity or safety. See “Freedom and Responsibility” from Netflix
Challenge that which has gone before
Our norms, processes, approaches, and opinions are not universal immutable truths. You should foster a healthy intolerance towards friction and challenge it where you see it.
Make Informed Decisions [source | source]
This is similar to Netflix’s “Informed Captains” or AWS’s “Bias for Action”. Basically you should look to make decisions quickly without waiting for a collective agreement. You do not need consensus but you must make reasonable efforts to seek dissent. In other words you must ensure your decision is well informed. Ask for feedback, socialise appropriately with those who have a vested interest and consider their input. The bigger the decision, the more feedback you should seek but don’t boil the ocean waiting for everyone to agree. Uninformed decisions are risky and demonstrate poor judgement. Be informed and decisive.

What you need my permission for

The following are scenarios where I ask that you seek my permission. Again in the interest of promoting autonomy, this list is purely the edges of the road.

Spending Money
Familiarise yourself with the <Insert company finance guidelines here>. There are some things that require higher approval and some things I can approve. However, generally you can‘t approve spending or spend company money without seeking approval. For team meals and recruitment expenses we have other prior arrangements so ask myself or one of your peers how they work.
Signing Contracts
This has two aspects - Finance and Legal. The <insert company finance guidelines here> governs whether you are allowed to commit to spending money via contracts. The legal aspect is whether you are allowed to sign on behalf of the company. As a general rule don’t sign anything before seeking my approval and legal.
Hiring people
This is a team effort and requires multiple people in order to control for our biases and utilise our relative strengths. Generally if someone is going to report to you then you will need a “one-up” confirmation . At a governance level we need someone with the authority to approve the salary and headcount.
Firing people or offical disciplinary action
Rather let myself and HR know at the earliest opportunity when you think you might have a problem. This is protecting you and our people.
Committing to changes in job description, salary, bonus
See above

What you need to tell me

The following are things I would really like to know that you are doing. You don’t need my permission but I’d really appreciate a heads-up. Send me a Slack message about it. The reasons I want to know can be varied but for example these things may have visibility beyond our team, I may get asked about them and I want to be able to be informed in preparation, they may expose you or your people to risk that I want to make sure we are comfortable with. It’s not an exhaustive list and the principle of “no surprises” always applies but consider informing me a courtesy.

Achievements, Success and Excellence
This is the most important one. Recognition.
Your concerns
I can only help you if I know what they are
Your feedback
Help me improve.
Incidents
If our customers are affected by an incident, we should follow the incident response procedures first and make sure it is posted in one of the Slack channels we use for escalation. I don’t need a specific heads-up in a DM as I watch these channels
Team moves
Moving team members between teams, although sometimes necessary, can be disruptive and sometimes leave us exposed to bias and poor judgement.
Office moves
Moving entire teams around the office can again be disruptive. Seek advice from HR but keep me informed.
Significant roadmap / commitment changes
Let me know what gets reprioritised and/or commitments we are not going to meet.
HR concerns
Really, tell HR. But also help me make sure our people are safe and happy.

My commitment to you

Direct Actionable Feedback
I will give you feedback that you can action in order to meet your goals and those of the department and company.
Open Information Where possible, I will endeavour to give you all the context and information you need to do your job to the best of your ability.
Live by these Leadership Values
I will endeavour to hold myself to the same values and standards I set for you.

Guardrails: Unlocking Autonomy through clarity and good leadership principles was originally published by Peter Goodman at PeteGoo on May 12, 2024.

Incident Response Part 4: SitReps (Situation Reports)

2024-03-13T00:00:00+13:00

Other parts of this series on Incident Response:

So you need an on-call team
Severity Levels
Incident Response Roles
Incident Response SitReps (this article)

In the 4th part of this series on Incident Response I’d like to cover something I learned quite late in my Incident Response career and that is Situation Reports or SitReps.

A few years ago I happened to be attending an online session by Brent Chapman, who at the time was leading Incident Response at Slack. He was giving a talk if I remember correctly on his experience performing Incident Commander at Slack, Google, and the Burning Man festival. He mentioned the signifance of Situation Reports and it made me realise that I hadn’t been consistent in the format of our Incident status updates and, to be honest, it was hurting us.

The need for SitReps

Using Slack channels and video calls for incident response is great but a lot of us will have experienced the scenario where you join a Slack channel or video call for an ongoing incident and you want to ask “Can I get a summary of where we are at?”. It’s awkward because you know you have to interrupt whatever is being discussed to devote attention to getting you up to speed.

You will probably even ask a few common questions like “What is the impact?”, “What is the severity?”, “What have we tried?”, “What have we ruled out?”, “What are we trying next?”.

This is where SitReps come in. They are a structured way to provide a summary of the incident. They should be delivered at regular intervals, should be concise and to the point. They should be delivered by the Incident Commander or a designated person and should address common questions that will be asked.

A SitRep template

The below is a suggested template for SitReps. You can adjust it to suit your needs but the most important thing is that you keep the format consistent.

⚠️ Situation Report ⚠️
Summary: Short single line description of the incident
Severity: Sev2
Status: Assessing/Stabilised/Mitigated/Resolved
IC: @PeteGoo
Responders: JoeBloggs, JaneDoe
Actions:
Have ruled out network issues
Joe is investigating recent releases to production to look for potential causes
Jane is reaching out to the Customer Success team to gather impact data
Pete is notifying stakeholders
Next SitRep: 3:15pm

In this way we can give anybody that is scrolling through the channel a quick summary of the current status, actions being performed, who to contact, and when the next update will be.

Try to cover off those common questions that will be asked like what you have ruled out and what you are doing now. Do be careful however, I’ve seen teams use that “Actions” section to keep appending more and more context as the incident goes on. It needs to be concise and to the point, like less than 8 lines. If you need to provide more detail then consider a separate document or a Slack canvas.

A great trick is to also pin/bookmark the SitRep to the channel so that it is easily accessible to anyone who joins.

Broadcasting SitReps

For a lot of folks who do not want the maximum amount of detail joining the Slack channel can be a step too far. Even worse, they may not know where the channel is or what it is called. In these cases it is a good idea to broadcast the SitRep to a wider audience.

The best way I have found of doing this is to have an #incidents channel where SitReps and notifications of new Incidents are broadcast to anyone that cares. In turn they can enable all notifications for that channel to get high level updates any time a new incident appears or there is a significant update.

Conclusion

Situation Reports have become a non-negotiable part of my Incident Response process. They provide a structured way to provide updates to the team and stakeholders and ensure that everyone is on the same page. They are a great way to keep the team focused and to ensure that the Incident Commander is not constantly interrupted with questions, especially when you include the next sitrep time to manage stakeholder expectations.

Future Posts

In future posts in this series we will cover:

Incident Response Playbooks
Reporting on Incidents
(Blameless) Postmortems
Paying people for on-call

Incident Response Part 4: SitReps (Situation Reports) was originally published by Peter Goodman at PeteGoo on March 13, 2024.

Incident Response Part 3: Roles

2024-01-27T00:00:00+13:00

Other parts of this series on Incident Response:

So you need an on-call team
Severity Levels
Incident Response Roles (this article)
Situation Reports

In this, the 3rd part of the series on Incident Response, we’re going to cover arguably one of the most important aspects of incident response teams and that is distinguishing the roles that responders should assume.

Incident Commander

If you talk to almost any incident response team you will find that they commonly have identified the need for one person to orchestrate the efforts of the team handling the incident. In high-pressure environments there isn’t time for consensus driven decision making, random interruptions, or waiting for people to volunteer for tasks.

The Incident Commander is the person who is responsible for:

Establishing lines of communication
Delegating tasks to other responders
Making decisions on severity, response, approach, lines of enquiry, and the response team
Situation Reports (SitReps will be covered in a separate post)
Concluding an incident
Evaluate the need for a blameless post-mortem and ensure it is done

You will see that I mention “Lines of Communication” and “Lines of Enquiry”. To me this is a great model for guiding the actions of an incident response team. Often we forget to validate our assumptions, explore other options, and communicate with the right people. For more details on this model read my earlier post - Lines of Communication and Lines of Enquiry in Incident Response. I won’t cover those details here.

Delegating tasks to other responders

One of the key responsibilities of the Incident Commander (IC) is to make sure that tasks are delegated and assigned to responders. It is important that the IC has enough space to perform the other responsibilities outlined here, therefore they should not overcommit themselves to many Incident Response activities.

ICs need to be assertive in delegating tasks. They should try to avoid asking for volunteers. Nobody has time for that awkward silence while we wait for someone to put their hand up, instead assign the task to the most appropriate person. Itʼs up the IC to know each responders capabilities, what they are working on, and how to prioritise the incident response tasks.

Making decisions on severity, response, approach, lines of enquiry, and the response team

By this stage you should have identified severity levels that are important to your business if not read part 2 in this series. An Incident Commander should familiarise themselves with the Incident Severity Levels .

You can change the severity levels during an incident. Sometimes the Severity can dictate the appropriate level of incident response so make sure you evaluate the severity regularly throughout the duration of the incident.

The Response Team

The Incident Commander chooses the members of the team.

Do we need to bring new members into the team for more specialised expertise or to add more hands?
What specific roles are each of the team performing?
Are members of the team fatigued or have other commitments and need to be replaced?

Identifying the Incident Commander

Not everyone will want to take on the responsibility of being an Incident Commander. It can be a stressful, tough (but rewarding), role and some folks may not feel inclined or ready to take it on. Preferably there should be a list of pre-trained and/or experienced staff who have knowledge of the procedures and expectations of being an incident commander.

Operator

The operator is the most obvious role in an incident response team. They are the person who is doing the work to resolve the incident. They are the person who is typing the commands, running the scripts, and making the changes.

At the beginning of an incident this may actually be the only role until we know that we have a significant enough incident to warrant an Incident Commander and Scribe.

This is often where people are most comfortable. Though be careful, without any designated Incident Commander you can only really have two Operators before things get really hairy.

Scribe

I always tell this story of how we identified the need for this role. It came from the behaviours of one of our QAs at a previous role. When we had an incident they would calmly slide over beside the folks involved and start taking notes in Sublime Text. At the time they used a plugin that noted the time beside each line. Later they would contribute during incident reviews by referring to the record of events.

Now we can just use Slack for this. On any incident, if there any actions to be taken make sure that someone performs the role of “scribe”. This is especially important if the response team is distributed, on a video call. Call out notes for the scribe to add to the record. For example:

Observations that have been made
Actions we are taking
Expectations that we have of those actions
Assumptions we have made so far
Impact analysis

Treat it like the court reporter in a trial. They are there to record the facts and observations, not to interpret them. The scribe should not be making any decisions or recommendations.

The output that the scribe produces will serve as the vital component of a Blameless Post Mortem - a true record of the timeline that will help us to understand what happened and why.

Optional Roles

The following roles are not necessarily something you need but they can be useful in some situations.

Impact Analysis

Impact Analysis can be a very detailed task. In special cases I found it useful to spin out one or more people to gather this data for Customer Success to contact or to guide the response plan.

They likely will have to dig deep on observability tooling, logs, a data lake, production databases etc.

Executive Communications

This is a little more common. It is an incredibly good idea to set the expectation with senior stakeholders that they should stay well away from the incident response team. Nobody needs the CEO/CTO rocking up into an incident response call and asking scary questions.

Typically the Incident Command will handled these requests but if the incident is large enough, is moving fast enough, and the number of stakeholders is large enough it can be useful to have someone dedicated to this role.

Customer Communications

This is a very important role. If you have a Customer Success team then they should be the ones to handle this. If not then you should have someone who is responsible for communicating with customers.

Future Posts

In future posts in this series we will cover:

Incident Response Playbooks
Reporting on Incidents
(Blameless) Postmortems
Paying people for on-call

Incident Response Part 3: Roles was originally published by Peter Goodman at PeteGoo on January 27, 2024.

Incident Response Part 2: Severity Levels

2024-01-17T00:00:00+13:00

Other parts of this series on Incident Response:

In Part 1 of this series we covered why you might need an on-call team. In this post we will cover how to define severity levels for incidents. This is crucial in order to understand the impact of the incident on your customers and your business. This in turn will help your on-call team, your leadership team, and the rest of the business understand how to respond and how to communicate the incident.

What are Severity Levels?

Severity Levels are a way of describing how important an incident is to your business. This may mean that the impact to customers is high, or the impact to revenue is high, reputational damage is likely, or any other factor that is important in your context.

Typically these are numbers from 1 to 5 or more, with 1 being the most severe and 5 being the least severe. I think it is best to avoid words like “Critical”, “High”, “Medium”, “Low”, and “Informational” as these are all relative and can be interpreted differently depending on your background, experience and language/dialect.

Also, don’t make them zero-based. People will end up using the term P0 or Sev0 in extreme situations or in jest but there’s no need to formalise it, it will just confuse the non-tech people.

Why is it important to define Severity Levels?

Incident response can be an extremely fast moving and stressful situation. Often when we are in the throes of an incident it can be difficult to contextualise the level of response required in relation to the impact of the incident itself and even the potential impact of our actions. If you do it often enough you can easily fall into the trap of burning out your team or, in the other extreme, being overly complacent through familiarity.

Severity Levels help us to understand a very key aspect of incident response - who we should communicate the incident with and how. I talked about this at length in my post Lines of Communication and Lines of Enquiry in Incident Response. I won’t repeat too much of that here but I encourage you to read that post if you haven’t already. In short, one of the most detrimental failures you can make in incident response is to fail to let the right people know that an incident is happening.

Severity levels can also impact how we respond to an incident. For example - it may be ok to leave an incident for a few hours or overnight if no customers are affected and we have a mitigation in place, or if the feature is seldom used. On the other hand we may decide for some incidents that we need to disable key features, communicate with customers, even sacrifice availability in very rare cases.

How do I define Severity Levels?

Severity levels are very likely unique to your context. The best way to define them is to sit down with representation from across the business - engineering, product, customer support, sales, legal etc and agree on what makes sense. You could use collaborative tools like Miro/Mural or a plain old whiteboard, then add cards for typical outages you have had in the past or can foresee in the future. Assign them to severity levels and then discuss and iterate until you have a set of levels that make sense to everyone.

What you are looking to end up with is a table much like the following:

Severity	Sev 6	Sev 5	Sev 4	Sev 3	Sev 2	Sev 1
Description	Internal Impact Only No customers impacted	Problems reported with non-core functions	Customer confusion to small subset of customers Background jobs failing Could become P2/P3	Issue affecting a small group of customers Redundancy loss with no impact Security near-miss	Affects large number of customers or a Top 10 customer Functionality severely impaired	A serious event affecting most customers. Generally unavailable Impairs ability to perform key tasks Security event e.g breach/disclosure
Typical Examples	-	-	-	-	-	-
Response	-	Inform Customer Success	Inform Customer Success Inform Engineering Leadership (VPE)	Inform Engineering Leadership (VPE+CTO) Implement in-product notifications of issue	Executive Leadership Team Raise Status Page	Notify Executive Leadership Team Notify Board

How do I evaluate the Severity Level of an incident?

Keep a link to the above table in your incident response documentation. When an incident occurs, evaluate the impact of the incident against the table. If you are unsure, err on the side of caution and escalate to the next severity level.

Add examples that are relevant to you and your business. These should be regularly revised.

How do I balance the need for impact analysis with the need to respond quickly?

It can be very difficult to evaluate the severity level of an incident in the heat of the moment. Often you are trying to prioritise stablilising the system over other seemingly non-critical tasks. This is one of the reasons why it is very useful to have more than one responder to an incident. With multiple responders you can spin someone out to assess the impact of the incident and communicate with the rest of the business.

Failure to assess the severity level can result in substandard communication protocols, disgruntled customers, and even a lack of trust in the on-call team. Aspire to have very clear guidelines on what constitutes a severity level. Keep the document up to date with typical prior examples so that these can more easily be assessed in the moment.

Future Posts

In future posts in this series we will cover:

Incident Response Playbooks
Reporting on Incidents
(Blameless) Postmortems
Paying people for on-call

Incident Response Part 2: Severity Levels was originally published by Peter Goodman at PeteGoo on January 17, 2024.

Incident Response Part 1: So you need an on-call team

2023-12-06T00:00:00+13:00

Other parts of this series on Incident Response:

So you need an on-call team (this article)
Severity Levels
Incident Response Roles
Situation Reports

As your product gains traction and expectations from customers increase, you may find that at some point things start failing. You see one or more of the following signs start to accumulate:

You find out about issues from your customers, not your team or your own monitoring systems.
These issues are increasingly appearing outside of your working hours and sit unresolved until the next day/week.
Your teams are fully committed to new features and are struggling to find time to fix issues.
You find it hard to prioritise issues because you don’t have a clear understanding of the impact.
You have no clear ownership of production issues so they get passed around between teams.

You may further find that these failures are starting to cost you in terms of the trust you have with your customers, the tension between people and roles internally, and possibly even the health of your engineering team.

You need an on-call team.

What is an on-call team?

An on-call team is a group of people who are responsible for responding to production issues. They are the first line of defence when things go wrong. They are the people who are woken up in the middle of the night when your systems fail. They are the people who are responsible for restoring service to your customers.

Do I need to hire another team of people for on-call?

No.

In a lot of traditional organizations, the on-call team is a separate team of people who are responsible for responding to production issues. This is a terrible idea. It creates a divide between the people who build the systems and the people who run the systems. It creates a culture of “throwing things over the wall” and “not my problem”. It creates a culture of “us vs them”. It creates a culture of “I don’t care about the quality of my work because I’m not the one who has to fix it when it breaks”.

Coda Hale outlines this beautifully in his talk Metrics, Metrics Everywhere:

“Our code generates business value when it runs, not when we write it”.

In other words we should really care about what our code is doing when it runs because that is when it is doing it’s job. If you don’t then you’re creating art, not business value.

There’s another aspect at play here and that is that the people who are responsible for creating the issue are the ones who are best placed to fix it. They are the ones who have the context and the knowledge to understand the problem. They are the ones who are best placed to learn from the issue and to prevent it from happening again. If you want an efficient engineering organization then you need to shorten the time from impact to learning and then outcome (more reliable software). You can only do this if the people who feel the pain are the ones who can alleviate that same pain. Separate teams creates weird power dynamics and misaligned incentives.

So should everyone be on-call?

It depends. If you have a very clear service architecture and a big budget you can have a rotation in each team though this can get prohibitive. I think that, if you have a separate SRE or Platform Engineering team then it makes sense for most of those folks to be on-call as a lot of the incidents that occur will need some insight into the underlying platform/infrastructure. Your service/product/program teams can be a little more fluid depending on how homogenous your services are, and how well the teams communicate changes and risks.

If you have a small org then just do what you can. See below for how to organize rotations and ideal rotation size.

How do I convince my engineers to go on-call?

There are a number of ways of doing this and it depends the resources (budget) you have available to you, the size of your engineering team, the level of trust you have with your engineers, and the amount of empathy they have for each other and your customers.

Start with the most dedicated, driven people

Chances are you probably have some people already on your team who are driven, care deeply about your customers, and are willing to go the extra mile to make sure things are working. These are the people you want to start with. They are the ones who will set the tone for the rest of the team.
Pay people for their time

If you have the budget you should pay people for their time because it’s the right thing to do. Here in New Zealand this is fairly easy to do. In the US this can be a little harder but we’ll cover this in more details in a future post.
Give them time in lieu for time spend responding out of hours.

Regardless of whether you pay people for their on-call time or not, when someone is called out of hours to respond to an issue you should give them back that time by allowing them to reclaim it from their working hours. If you are paying them for being on-call chances are it’s not their salaried rate any ways. It also will go a long way to helping them justify the disruption to their personal lives, those of their partners and kids.

Further, in my experience, this is something you have to reinforce. People will try to be heroes and power through. Gently remind them that they need to take time to recover.
If you can, always respond in pairs

We learned this at a previous company and it served us really well. Psychological safety at work is incredibly important. Psychological safety at 3am when things are on fire is even more important. Two pairs of eyes are infinitely more reliable, safe, and effective than one. Having a copilot to make sure you’re typing the right command, clicking the right button, shutting down the right server, or whatever, is invaluable.
Make sure you have a clear escalation path

As for #4, make sure people know that they are not alone and they can always escalate. That typically means that you will be contactable yourself. You need to make it ok and make it something that you would rather they did in a time of uncertainty than not.
Recognise and praise the on-call team regularly

When there is a significant incident make sure to publicly praise the on-call team, and thank them for their contribution, no matter the outcome. Other people see this and it helps to build empathy for those that are on rotation.
Have a company phone plan for the on-call team

They may well have to hotspot whereever they are so make sure they have a company phone plan that covers this. It’s a small thing but it’s a nice thing to do.

One of the surprising outcomes that I notice about on-call teams is that the people on-call have a much better mental model for the way that the software works. They are more active in architectural and design discussions as a result and they tend to be more effective generally. This means they get promoted faster and this gets noticed.

We often would find that we had a queue of people who had expressed interest in joining the on-call team. This was true even before we paid people an hourly rate for being on-call. When I asked people why they wanted to join the rotation they would tell me the same thing - it was seen as a great learning opportunity and way to grow their career.

To be clear, you don’t promote people because they are on-call, you promote them when they become more effective at their jobs.

Another observation I made though was that this good will erodes very quickly if the on-call team is getting woken up constantly, are unable to effect the outcomes they need, and are generally getting beaten up night after night. You still need a good incident response, continuous improvement, learning, and blameless culture in order to make this work.

What about the people who don’t want to go on-call?

We all have lives and different priorities. Listen to their context and apply some empathy. They might have young kids, be caring for a dependent, dealing with a health issue, shared living spaces, who knows.

Organizing rotations

This is highly dependent on your specific product and team topology but here are a few guiding principles:

Healthy rotations in my experience are 1 week at a time, every 4-7 weeks. This gives enough time for recovery while not being too far apart that you forget how things work or lose context on what has changed.
Rotations should preferably be in pairs. This is for psychological safety and to make sure that you have a copilot to help you out.
Swaps will happen but you need to set some ground rules like no more than two weeks on-call for any individual.
Handover rotations during the week, during working hours. Tuesday midday for example is good.
Christmas and New Years needs special treatment. We would shorten this to 1 or 2 day rotations and make sure that the inconvenience of being on for key days like Christmas day and New Years day was spread out.
Set expectations around taking a laptop everywhere you go and not drinking alcohol or partaking in other mind altering substances while on-call.

At a previous company we would have an SRE team and a product specific team member paged for each incident.

For example:

Product 1 alert -> SRE + Product 1 team member alerted

Product 2 alert -> SRE + Product 2 team member alerted

Very rarely did we have overlapping incidents unless there was a cloud provider failure in which case we merged the incidents anyways.

Future Posts

In future posts in this series we will cover:

Incident Response Playbooks
Reporting on Incidents
(Blameless) Postmortems
Paying people for on-call

Incident Response Part 1: So you need an on-call team was originally published by Peter Goodman at PeteGoo on December 06, 2023.

Stop chasing consensus, start building momentum and farming for dissent

2023-09-21T00:00:00+12:00

In the early days of a close-knit team of similarly experienced people you have this amazing ability to make decisions quickly. Remember that? You need to move fast and not sweat the small things. You have really important challenges to solve and you accept that getting some things wrong initially is very much ok.

As your team grows into many teams…guilds…chapters…or whatever (no judgement), you find that each of these small decisions are taking longer to make. You start to see that the team is spending more time discussing and debating these small decisions and you start to wonder if you should be doing something about it.

The thing that tends to happen at this point is that consensus has become the order of the day.

There is this weird belief that consensus is a great thing to have in a group of people. I guess it is, but it only happens if the group is very, very small. Why? Because instances of consensus trend towards zero as the size of the group increases.

This is why many teams have discovered an alternative approach - invert the problem and instead of chasing consensus, look for dissent. If you look at the Netflix Culture and Valued Behaviours you will find this resonates with their behaviour of “Informed Captains”

For every significant decision, we identify an informed captain of the ship who is an expert in their area. They are responsible for listening to other people’s views and then making a judgment call on the right way forward. We avoid decisions by committee, which would slow us down and diffuse responsibility…

…On big strategic issues, the captain farms for dissent and other alternatives to ensure they are truly informed. Dissent can be difficult, which is why we make an effort to stimulate discussion…We don’t wait for consensus or vote by committee, nor do we drive to rapid, uninformed decision making…The bigger the decision, the more extensive the debate. Afterwards, as the impact becomes clearer, we reflect on the decision and see how we could do even better in the future.

[source]

Similarly, Amazon discuss how they “bias for action” and “disagree and commit”

Speed matters in business. Many decisions and actions are reversible and do not need extensive study. We value calculated risk taking.

Leaders are obligated to respectfully challenge decisions when they disagree, even when doing so is uncomfortable or exhausting. Leaders have conviction and are tenacious. They do not compromise for the sake of social cohesion. Once a decision is determined, they commit wholly.

[source]

I really love the term “farming for dissent”. It recognises that there is work involved in getting people to speak up and that it is an active, not passive, approach.

In Amazon’s case this idea of reversible decisions is best described by the term “one way door vs two way door”. I’ve found this approach really useful in evaluating the risk of making a particular decision.

How can farming for dissent go wrong?

The key thing to remember is that dissent is not about being contrarian. It’s not about being difficult or awkward. It’s about being informed and having a different perspective.

Another failure mode of this approach is that it often favours the “loudest voice in the room”. Being loud and opinionated is not the same as being informed and having a different perspective. It’s important to make sure that you are hearing from all voices in the room and that you are giving people the space to be heard. This is a skill that leaders need to develop.

Building Consensus

I firmly believe that you need to have some level of confidence in your proposal before you put it out there for broad dissent. Why? Because physcological safety is a basic need and we all feel a little impostor syndrome and self-doubt. Also some people can suck at delivering constructive feedback.

The answer is to socialise your ideas with a wing-person or two. This is a great way to get some feedback and to build confidence in your ideas. It’s also a great way to get some feedback on how you are presenting your ideas. You can figure out how best to land the message and how to make sure you are heard.

I often form strong opinions after layers of socialising my thoughts through consecutive circles of trust. By the end I have stronger reasoning and greater confidence.

So how do I know I’ve made an informed decision?

Basically you are looking for the goldilocks effect of feedback. You want to make sure that you have enough feedback to make an informed decision but not so much that you are paralysed by it.

Don’t just stick to your wing-person or your own echo chamber and actively seek dissention. Discard what is irrelevant and focus on the feedback that challenges your assumptions.

Summary

Consensus has broken many great organisations. Try to invert the model, seek feedback and farm for dissent. Then commit to a decision, document it, and move on.

Stop chasing consensus, start building momentum and farming for dissent was originally published by Peter Goodman at PeteGoo on September 21, 2023.

Zero Trust Authenticating Reverse Proxies in AWS Application Load Balancers

2023-09-14T00:00:00+12:00

Traditional VPNs have been the goto solution for many companies when considering how best to secure access to their internal tools on the public internet. With the widespread adoption of the hybrid office and remote working in the covid era the use of VPNs has significantly increased.

Y U NO VPN?

However the traditional VPN approach has come under scrutiny in recent years due to a number of fundamental flaws in their design and implementation.

TunnelCrack and other vulnerabilities

In August of 2023 a vulnerability present in most VPNs was uncovered which could allow an attacker to convince a VPN client that a secured site behind the VPN was actually a local resource. Once in place the attacker could essentially steal any data that was intended for the target. This vulnerability was coined “TunnelCrack” and serves to represent one of the main flaws of the typical VPN architecture as we’ll see later - that not all networks should be treated the same.

If you visit the CVE databases you will also find endless disclosures of vulnerabilities in just about any VPN implementation out there. Unpatched, one of these could become an existential threat for any network architecture.

Network level access

Typically VPNs give you access to a network, or part of that network. They essentially route all traffic bound for a certain subnet or significant range of IP addresses over the VPN. This has an unfortunate side effect that the type of traffic is unbounded. Often though, we know that the individual applications we want to expose over the VPN have a single type of traffic, like simple web applications that use https, and we don’t need the rest of the network to be exposed.

It’s like the difference between allowing someone to make a phone call to a person in your organisation through your switch board vs driving the caller in an armoured car to your office door, then letting them loose inside.

Now your particular network topology likely has a way to limit these impacts. If you are using AWS for example, you can minimise this by implementing security groups, cross-referencing security groups to allow fine grained network level access but this can be difficult to manage and can become impossible at scale due to inherent limits in the number of rules you can define.

The local network becomes inherently “trusted”

When we take this border-focused approach to our network topolgy there’s a really interesting side effect that we see. This design drives us down a path where we treat anything “inside” the network in which our applications reside as “trusted”. Once inside that network there is little to stop you moving around. Hence compromising a VPN connection can lead to very dire consequences.

What if we were to assume that no networks are inherently “safe”. Well, this is where Zero-Trust comes in.

Aside: Why not YOLO your apps on the public web?

Before we tackle zero-trust we need to answer this question. It might seem like a good idea to put your applications on the public web, I mean they all can implement authentication, right?

Well, here’s the thing: I don’t trust anybody to secure the entire surface area of their web application.

What do I mean by that? Well, let’s take an example - you have a build server of some sort, maybe it’s Jenkins, that has it’s own login page so you put it on the public web. Now we all know that it needs to implement good brute force defence, well thought out password reset flows etc so maybe we choose the SAML or OIDC option that it implements so we can defer all that stuff to our Enterprise IdP like Okta. Problem solved, right? Now Okta takes care of all of our authentication concerns, right? Right?

Well, what about all those other endpoints and pages on that app - have they remembered to implement authentication on all of those and make sure none of them are exposing unauthenticated functionality or other vulnerabilities? What about in the next update, with the next feature, and the one after that?

Incidentally, choose OIDC over SAML if you can. There are many flawed SAML implementations out there and OIDC is more performance-friendly and easier to implement.

Zero-Trust

I first came across the term Zero Trust when Google published the BeyondCorp set of guidance. The major aha moment for me was the idea that, no matter if you were in a Google building or working remotely, your access to applications necessary to do your job was via the same set of controls. The major feature of those controls was an “Access Proxy”, sometimes called an “Identity-Aware Proxy”.

The idea of this proxy is that it acts as a gateway so no traffic gets through to the target app unless it has been authenticated. There’s one implementation with a very small surface area.

When I first started playing with Zero Trust I was using oauth2-proxy. It was fine but I had to run it myself on an EC2 instance, make sure it was the latest version and generally feed it myself.

Zero Trust Access Proxies in AWS

In AWS we typically host our applications behind an Application Load Balancer (ALB). This allows us to choose to run the application in a container or on an EC2 instance while at the same time scaling it out and offloading TLS and other concerns.

A feature of the listener rules within an ALB are that you can specify that no traffic is allowed through unless the client has been authenticated via OIDC or Amazon Cognito (which in turn can support social login, SAML etc).

In this way, no traffic is allowed past the ALB unauthenticated. You can even pass through the access token, identity and claims as headers to the target application once authenticated.

A great benefit of this approach is that now AWS is responsible for patching and guaranteeing the safety of this approach and scaling the hardware. In compliance terms you’ve moved more things into their side of the responsibility matrix.

AWS Verified Access

A newer service from AWS that abstracts all of the ALB configuration away so that you can deploy a private ALB and still proxy authenticated traffic through to your applications is AWS Verified Access.

This effectively allows you to do the same thing but you don’t have to configure OIDC on each load balancer and instead can centrally configure that once. You can also use Cedar policies and device claims like MDM certificates to further implement your Zero Trust posture. The only downside here for me is the cost. AVA will cost at least US$200 / month / application whereas an ALB is only US$23 or so, depending on configuration and usage.

Zero Trust Access Proxies in GCP

In GCP you can implement something very similar using Google’s Identity Aware Proxy service.

Zero Trust Access Proxies in Azure

In Azure the closest thing is (I think) Azure AD App Proxy, although I’m not as familiar with this.

Summary

So really, if all you need to do is safely secure some web-based applications like your CI/CD tools, reporting, and other internal tooling but keep them accessible over the public internet then you’re better off dropping the VPN and implementing a Zero-Trust Access Proxy / Authenticating Reverse Proxy. It’s likely cheaper and safer in the long run.

Zero Trust Authenticating Reverse Proxies in AWS Application Load Balancers was originally published by Peter Goodman at PeteGoo on September 14, 2023.

Lines of Communication and Lines of Enquiry in Incident Response

2023-02-22T00:00:00+13:00

Over the last decade or more I’ve been involved in a lot of incident response activity and subsequent post-mortems or post-incident reviews (a good thing). I’m always incredibly interested in the behaviours of incident responders and just how difficult it is to remember all the right things to do when you are the one sitting in the hot seat.

At times I have seen first hand that frozen state where you forget how everything works and have no idea where to look for clues. At other times I’ve seen that all-too-common scenario where the incident response team is busy looking around, diagnosing, investigating but they have forgotten to tell anyone that there is an incident going on so that customers and stakeholders can be informed.

A few years ago I came to the realisation that there are two distinct lists that you should keep for your incident response team.

Lines of Enquiry
Lines of Communication

I decided to put these in a wiki page that is easily at hand. Our incident response bot links to these pages so folks can easily find them when they think “Lines of what again?”.

Lines of Enquiry

These are going to be different for almost every organization but there are some common themes that you should try to hit.

Firstly, all things happen for a reason and that reason is almost always a change of some sort. Use your experience and trawl through your incident reports. Try to think of the most common changes that cause issues. These are typically deployments, infrastructure changes, patching, flag flips, etc.

For us, this looks like:

What did we change?

Very often a system starts to misbehave after a change of some sort

Was there a recent deploy?

Did the issue begin happening just after or during a new deployment?

Is there a chance that the deployed change is related?

Is the deployment safe to revert?

Deploy the previous release asap and continue to investigate while you wait for the result

Are we currently patching machines?

Was there a change in machine images that could have caused the issue?

Was there a recent update of the operating system or a component?

Has a related workload been deployed recently?

Have we just deployed some terraform changes?

Next you want to think about any changes in traffic patterns. If we didn’t change something explicitly, maybe the behaviour of our users has changed or we could be under some kind of volumetric attack.

2. Has our traffic changed?

Is there more traffic load?

Are we seeing unusual load on certain endpoints?

Is it a <insert seasonal outliers e.g. end of month, end of year>?

Are we processing a large number of queued messages / workload A / workload B / workload C?

It’s a good idea to give the readers links to dashboards etc here. Nobody wants to be messing around with subpar wiki search engines during an incident.

Next up, it’s always good to check if it’s someone else’s stuff that’s broken. I’ve found that, as your site reliability grows, it becomes painfully obvious how terrible other people are at theirs. It is not uncommon to be the one to make someone else aware their stuff is fried. One of your best tools is DownDetector or, my personal favourite, just searching Twitter.

3. Has one of our partners had a fault?

Is <insert cloud vendor> reporting issues?

<link to their status page>

Single AZ failure? Regional?

Twilio / SendGrid / Vendor A / Partner B ?

Global CDNs? CloudFlare? Fastly? Akamai?

DNS providers?

Mobile Carriers? AT&T? Verizon?

Large networking providers? Commcast? BGP again?

Then there’s a list of things that can change without any action by a human.

4. What could have been changed on us?

Could the SQL query optimizer have created a bad plan in the database that is impacting our query performance? Purge the worst plans.

Could a scheduled maintenance job have kicked in?

Could a container have recycled?

Autoscaling kicked in/out?

Now you want to think about less recent changes that may only be impacting now because of some confluence of events.

5. Could this be the first time we have done this since a change?

It may be the first time a certain type of scheduled activity has run since a change.

It could be that we have just scaled out our cluster since a change to the machine images, configuration, code.

Lastly it’s a good idea to think about adjacencies. You never know when you might benefit from that extra context which is going to challenge your assumptions.

6. What else could be affected?

Look for clues from services with the same dependencies etc to see if they also have issues but are not producing alerts in the same way.

Lines of Communication

My personal failure mode in the past has been getting lost in the details of an incident and forgetting to inform other people. They have been some of the hardest post-incident conversations. It’s understandable but it’s also incredibly frustrating when you find out way too late that something is going on, or customers are calling up complaining and you have nothing to tell them, or someone says “hey tell me about this ongoing incident for <that thing you’re responsible for>” and you have no idea what they are talking about.

So here are some starting points for lines of communication.

Have you got an Incident Commander?

You’re in Incident Response. You need one. Escalate to one if you don’t

Have you told Customer Support / Success that customers could be affected?

Do you need to let your Manager / Director / VP know?

<Insert your own escalation policy here e.g. If it’s Severity 1-3 then you should…>

Should we post a StatusPage?

<insert your orgs philosophy on when this is appropriate and who makes the decision>

Have we opened a ticket with <cloud provider> / <Vendor A> / <Partner B>?

Let them know they have a problem

Hopefully these two lists will serve you well. Remember that they are there, use them, and feed them regularly…

If you’re in an incident, everything is burning and you don’t know what to do - Lines of Enquiry
If you’re in an incident and you get that horrible feeling that you forgot to do something - Lines of Communication
If you have a few spare minutes while you wait to see if something has had an impact - Lines of Communication

Lines of Communication and Lines of Enquiry in Incident Response was originally published by Peter Goodman at PeteGoo on February 22, 2023.

Small Pull Requests and Batch Size

2022-02-18T00:00:00+13:00

One of the things I find most interesting about my role these days is that I get to talk with such a wide variety of people across our engineering group and I get to hear various challenges that they face. As a result I get to observe common patterns and themes that emerge which often stem from similar issues. I’d like to talk about one of those common issues - Batch Size.

My experience

When I joined my current company many years ago now there was something about the way that we worked that took me by surprise. It was fast, but more than that I felt more focused, I was more engaged than I had ever been and I was learning rapidly.

We deployed multiple times per day. We were reviewing each other’s code multiple times per day. Our poor QA (singular) was getting smashed but managing to keep up with this mayhem. How was this possible? I’d never seen this pace before.

The reason it worked was that we had made each change or “Pull Request” small and incremental. We had a saying that was basically “do the smallest, dumbest thing you can to learn the next thing”.

The science bit

The advantages of breaking tasks down into smaller chunks is something we have all experienced in lots of aspects of our lives.

In manufacturing and economics circles this is sometimes referred to as Lot or Batch Size and it’s an important factor in the throughput and efficiency of any system. Don Reinertsen does a really good job of explaining the theory but ultimately reducing the size of the changes we make leads to reduced cycle times, consistent flow, faster feedback, reduced risk, fewer overheads, greater efficiency, higher motivation and reduced costs.

Our batch is a Pull Request. It is the car in our assembly line.
_{(If you have a long lived feature branch then your batch is the feature branch))}

How does this benefit me (you)?

The weird thing about my experience was that it also caused me to change my behaviour and the relationship I had with my code. I used to write a bunch of code on my machine, add to it, add more, refactor, add more, test it, clean it up, write tests (whoops), double check it, triple check it, then eventually let someone else see it when I knew it was safe for me to do so. I was optimising for never being wrong, not learning to improve. I built up so much stress during this process that it was a rollercoaster of fear and insecurity. Not good.

In my current company however, if I had more than a day’s work on my machine I started to get nervous, like I was walking around with a wallet full of too much cash.

There are other benefits though. Code reviews are quick and easy. More tests are automated. QAs are able to focus on what matters. Product and UX can provide timely feedback. Course corrections happen earlier before wasted effort. Incident impacts are smaller and downtime is shorter because when things go wrong in production it’s easy to see what changed. We do fewer revolutions, big refactorings, rewrites. Long lived feature branches and merge conflict resolutions are a thing of the past. We focus on continually providing value, learning from how our customers use our software and responding to their changing needs.

This isn’t to say that we shouldn’t design software or plan how we will implement it. This is still a skill we need to exercise, however no design or plan is perfect so why would we wait to find that out?

Small Pull Requests and Batch Size was originally published by Peter Goodman at PeteGoo on February 18, 2022.

This is my picture: Why you should be drawing your systems and code

2020-02-26T00:00:00+13:00

How inclusive is the culture of architecture and design within your teams? Is the way you communicate these designs and intent accessible by everyone? Can everyone in your organisation easily contribute ideas and share their experience?

For the last few years I’ve been trying to actively spend more time drawing code, architectures and concepts rather than just talking about them. I feel like this was once a lot more common in the places I have worked but seems to have become something of a rarity.

The problem with words

When I first moved to New Zealand I found myself working for a company building legal software. On day one an architect took me into a room and started explaining the architecture of the product in the terminology of Domain Driven Design (DDD), a concept I had never come across before that is packed full of specific terminology and concepts. So much so in fact that the bible of DDD by Eric Evans is 560 pages of pure gold but takes quite a few reads before it really sinks in. Ironically one of the main goals of DDD is to reduce confusion between individuals and teams by devising a shared ubiquitous language

So here I was getting a massive download of architecture and historical context. Like a typical imposter dev, I wasn’t going to admit that a lot of these new words and acronyms were completely foreign to me.

So how then was I able to understand any of what was being said at the time?

The simple answer is that he drew on a whiteboard with a marker as he talked. The result was that this impervious wall of nomenclature washed over me while the picture filled in the blanks that were left behind.

So why are pictures important

Drawing pictures in front of people is a room leveller.

We have so much assumed context when we present to or talk with our peers that we often forget that we may have completely different backgrounds and experiences. For example:

We may not all have the same level of confidence in groups and so we may not ask questions
We may not all have the same first spoken language
We may not all come from the same social circle / tech / culture or country
We may not all have gone to college / university or read the same comprehensive book on exotic and seldom used design patterns

Simple diagrams can transcend these differences. The box that represents a thing. The line that represents a relationship of some sort, traffic, data or control flow. A big cloud of amorphous internet. A stick figure user. These shapes describe abstract concepts more universally than any specific words could.

Why is drawing pictures important?

There is more intent and meaning communicated in the act of drawing than just the end product.

I’ve often had the experience where I’ve tried to recreate a successful whiteboarding session to someone else using only the finished picture, expecting them to instantly gain the same understanding I did when it was drawn. Except now, they just look at me blankly. Why? I may even look back at the drawing and think, this is nonsense. These are the insane scribblings of an unhinged individual.

Watching a drawing unfold in front of you as the intent is explained builds understanding as the picture evolves. While the conversation continues the breadcrumbs of how that understanding was formed are still there in full view, reinforcing the conceptual model we have built as it is committed to memory. At the same time it allows the presenter to add a layer of language and terminology on top of that new understanding.

That newly accumulated understanding, language and terminology will always be rooted in the memory of the drawing. A drawing that is relevant and helpful. Without these visual memories I tend to associate what I learned with the shoes of the presenter, the smell of the room or the colour of the walls.

These visual anchors that are still on the board are even more useful when participation is encouraged. The terminology and scope can be expanded as those present contribute to the refinement process. This interactive part is where the diagram really comes into its own. Use it as others expand on your ideas and provide alternatives. When questions are asked, point back to the relevant elements as you answer. You may even find yourself drawing more elements or just pointing as others talk and refine your explanation.

How to do it

So there are some rules to follow when whiteboarding / drawing for a group

No UML!!!
Limit the predefined shapes. If you need to draw a pipe for a queue, explain what it is and why a pipe works.
A database might be a cylinder but reinforce what it is as you draw it
Stick to boxes and lines as much as possible
Sure, add arrows for direction
Use colour sparingly. Two, three colours. After that people need a legend.
Try to avoid sequence diagrams. They don’t work well for asynchrony or fan out / fan in.
Don’t mix different levels of abstraction in the same diagram. Use an inset or callout box for detail.
No fricking UML!!! Nobody cares that you took the time to learn it once.

Running an open inclusive session

It’s a really good idea to encourage this style of presentation and communication within an organisation. Don’t just settle for the same individuals talking at the others in your teams. Diagramming and whiteboarding can make your workplace more inclusive and democratise architecture and design.

In our office Friday at 3pm is “This is my picture” time.

It lasts an hour
We have beer and pizza at 4:30 so people are winding down anyways, might as well maximise on the reluctance people have to start deep thinking.
We have a Slack channel where we remind folks that it is on
We make a list on the board of carry over items from last week
We ask for ideas, these can be any of:
- Something you want to build
- Something you have built
- Something you are building
- Something that exists
- Something you can draw
- Something you want to see someone else draw
- Something you know
- Something you want to know
List the ideas on the board
Now ask for a show of hands on each item in turn
Mark the votes beside each item
Start with the highest voted item and ask for volunteers to draw it
If nobody is willing to draw it or it doesn’t get enough votes then it can carry over
Almost always it spills over in discussion to drinks and pizza
Live slack the list of ideas so that stragglers can opt in
Take photos of the presenters with the end product. Like a kid holding up their picture to the class. Post them to the Slack channel for posterity.
Don’t be afraid to repeat ideas weeks or months later. Try a different presenter. There will be new people present and/or different perspectives in the room.

I run an open whiteboarding session on Fridays in our office. For Christmas my team made me some magnetic icons. All the usual ones are there AWS Buckets, Route 53, EC2, MLP, potatoes... pic.twitter.com/KSt3n65zrP
— Peter Goodman (@petegoo) December 18, 2019

FAQ

Q. What about having multiple people working together on a board?
A. This is ok for a discussion between two people but is confusing for a group. Try to avoid unless the two presenters have a good collaborative presentation style. You’re telling a story after all.

Q. What about remote teams / individuals?
A. Well sometimes we try to do it on hangouts and/or record it. This is ok. YMMV

Q. What about digital / online tools.
A. Keen to hear about suggestions but these can be prohibitively expensive and clunky.

Q. Is this just for developers?
A. Absolutely not. I’ve been trying to get more QAs, UX, Designers, Product people involved.

This is my picture: Why you should be drawing your systems and code was originally published by Peter Goodman at PeteGoo on February 26, 2020.

Better Octopus Registration

2018-11-15T00:00:00+13:00

We use Octopus Deploy for a lot of our deployments. It has great primitives that help us create simple, reliable, repeatable deployment processes.

If you are using it to deploy code onto machines you typically use the supplied agent, installed on your machine. Octopus calls this agent a Tentacle, naturally. The Tentacle registers itself with the API on the Octopus server. Your machine will define which roles it would like to perform and these roles can be used to define which deployment steps will run for that machine.

The Problem

There is an issue with this registration approach however. In order to be able to register, the Tentacle needs an API key. You can scope that API key to an Environment like Test, Prod etc but not, as far as I can tell, to a role. Therefore the tentacle can ask to be any role it likes. Even if you could restrict the API key to a role, managing the keys and their scopes would be a nightmare.

So now we know that a machine we intend to be non-sensitive-role, if it were compromised by an attacker can now register (or re-register) itself as very-sensitive-role, essentially creating a form of lateral movement. For example, if the very-sensitive-role delivered some code with a database connection string and password from Octopus variables but non-sensitive-role was never designed to get those secrets then we have a problem.

The Solution

So how do we get around this. Well, we wanted to eliminate the reliance on the machine telling us what role it should be. Our machines are in AWS and we can use EC2 Tags to add metadata to our machines. So we added an OctopusRole tag when AWS creates the machine (EC2 instance) with the name of the role(s) intended to be used by that machine. You can also add OctopusMachinePolicy if you want.

Then when we want to register the machine on startup, we remove the need for access to the API by instead publishing an SNS message that simply has the EC2 Instance Id and any other useful information like the Tentacle thumbprint.

This SNS message triggers a Lambda which uses the EC2 APIs to query the metadata for the instance. The lambda then registers with Octopus on behalf of the instance. Octopus will subsequently reach out to the machine to establish the connection to the listening Tentacle agent.

What did we learn?

Basically validate your client inputs. They can lie like terrible lying things.

Lambda is great piece of glue you can use to solve these types of problems. Now you can even write them in Powershell should you so desire. Personally I write most of mine in Python but you can choose your poison without needing to change this pattern.

Better Octopus Registration was originally published by Peter Goodman at PeteGoo on November 15, 2018.

Measuring and Improving your CI/CD Pipelines

2018-11-09T00:00:00+13:00

Over the last 4 years I’ve often found myself in conversations with fellow engineers about our build and deployment process and how we feel it has become slower or is somehow causing more friction.

Eventually, as you live with a release process that you use every day you will find that you have these conversations relatively often. So how do you go about figuring out if you have issues and where they are.

TL;DR

Draw it out
Be mindful of your bias
Measure everything
Gather feedback on outliers
Split and parallelize steps
Look for human wait times
Help your engineers solve their own problems earlier, before they become everyone else’s problem

Why measure

We all know that smaller releases, more often helps us deliver value to our customers, with less risk and this gives us a competitive advantage. Therefore we all want more throughput in our pipelines. If you have anything more complicated than a very simple one build, one test suite, one deploy pipeline then this can be a difficult thing to achieve.

We use a train metaphor for the pipelines involved in the shipping of our releases. Sure you can build more trains but that comes with complexity (to really drag the metaphor out, junctions and stations). A faster train is always going to bring benefits to your continuous delivery pipeline. Build faster trains.

To do this you need to measure how fast your releases are.

As an aside, I used to think that the number of releases performed per day is the best statistic to track. It is interesting but to be honest it’s basically bragging rights in a lot cases.

What you should care about is how fast you can release when you need to, not how many times you release per day/week/month

Optimize Wisely, Not With Bias

Most software engineers, me included, have their own bias about what they think is the worst, slowest, flakiest part of the release pipeline. This opinion comes not from observation and measurement but from scar tissue and technical preference. Resist the urge to optimize for what you think the problem is. Measure it and make informed decisions about where to spend your time.

What you will need

A pen and paper

or even better

A whiteboard and marker

Yeah, this isn’t really about tools, it can be but it doesn’t have to be.

I’m a real believer in the power of diagramming, I’m not talking about UML here. In fact I’m specifically talking about NOT UML. Boxes, lines and words are what you need. Patterns & Practices, acronyms and specific terminology can be incredibly devisive. Diagrams are a room leveller, they bring everybody into the same conversation, losing the least amount of participants along the way.

Getting Started

Think about the start of your deployment pipeline and draw the first box. Don’t go back to requirements gathering or some nonsense like that. Start with a Pull Request for example, or a merge, or in our case, joining our ship-it train.

Draw the Process

You may have a CI/CD tools that has pipelines but I guarantee there are more things involved here so draw it manually, it will free you from the constraints of the pipeline tool.

From there you want to start thinking about the stages that happen up until the point that the automation starts. There may be none if you started with a merge or there may be some human co-ordination involved.

This is key, you need to capture the human processes too.

Draw each state in the state machine. Connect them with lines to show the sequence. For us this looks like

Join
Roll Call
Merge

When you come to the chain of builds, draw each build stage and try to represent fan out/fan in of parallel tasks, this will become important later.

For me I choose to draw the stages as a vertical pipeline then switched to horizontal for the builds.

Your process should end with the point at which you are happy with the release in production.

Measure the Builds, Steps and Stages

The next step is to add timings to the steps involved. I find it easiest to start with the builds.

Look at your builds and test runs and take a sample of timings for each type. Figure out what the median is and write it next to that build step or test run box in your pipeline drawing.

At this point you have some timings and there are things we can infer and optimize which you will see later but resist the urge to concentrate on the automation. Often the biggest problems and most effective changes can be found elsewhere.

Note the deployment step times, for each environment. Some environments for us take longer because they have more machines.

Do the same with the manual steps and stages. In our pipeline we use a bot to orchestrate the pipeline stages, it co-ordinates the human workflow in a simple state machine by listening for prompts from engineers involved in the release. The bot posts into slack the current stage. I use the timestamp of those Slack messages to write down some timings. If you have normal human Slack conversations only, try to determine the start of each stage from the timeline or encourage folks to post the stage for a few days to get these numbers. Again take the median and add it next to each stage or step.

Note your End to End Pipeline Time

For us I like to measure from Roll Call to the start of the next pipeline. This to me is the time it takes us to ship a release.

Decide what your end-to-end pipeline measure is and take note of the time it takes. Improving this metric is your goal.

Track Why Some Releases Take Longer

Now that you have a timing for how long this normally takes, as an engineering team, start recording why it sometimes takes longer.

Some common examples are:

Complex manual testing
- The changes touched a lot of things so it needed more manual testing
Re-work in the pipeline
- Compile errors
- Test failures
- Reverts
People orchestration
- Key people were in meetings, out to lunch
- I didn’t notice that I was up / required to do something
- A failed build wasn’t noticed until some time later
- A key person e.g. tester had too many things to do

Parallelize

Look at your build steps and test suites with their timings. You can now see some optimizations where parallelism can be a deciding factor in what you do next.

Can you run some things in parallel?

Some steps can be easily parallelized. If you have 5 consecutive test runs, can you do them in parallel? Your CI tool can most likely orchestrate this for you.

We chose to run some of our tests in parallel with the deploy to our test environment. Eventually we even ran our unit tests in parallel with our deploy to test. We made this decision because we looked at the failure rate of unit tests. They didn’t fail. They had already been run and passed on the developer’s machine and then again on the Pull Request branches before merge so we knew they were good. Sure, a merge could create a problem but this was so rare it was worth the risk of re-work in the pipeline.

Split Builds

Look at the longest test suites. Can they be split into multiple parallel build steps / test runs?

Don’t Spend Time Making Small Things Faster

Look at the timings. If Test Suite A takes 5 mins and Test Suite B runs in parallel and takes 10 minutes, don’t spend time on Test Suite A trying to make it faster, it won’t affect your end-to-end timings.

Look for Human Wait Times

Sure, sometimes we are waiting on the computers to do build things or test things. Often, however, it is the co-ordination of the meat-bags (humans) that is the problem.

For example, we use a build bot modelled on the Etsy train but implemented in Slack. We call it C3-PR (PR for Pull Requsts). One thing we found is that if we mention the people in the carriage we have a better chance of having them perform the tasks we need them to like Merge, Deploy etc. If you have no human involvement in your pipeline then I commend you, but most folks I talk to have some human involvement at least in failure scenarios. These human factors therefore can be very important in realising maximum throughput in your pipeline.

Be Kind To Your People

Can your build tool notify people earlier that a test has failed and continue on with the rest or does it have to wait until the end of the test suite?
Could an Engineer have found the source of re-work (build / test failure) earlier on their machine or the Pull Request before it was merged into master? In other words Help your engineers solve their own problems earlier, before they become everyone else’s problem

Conclusion

Hopefully this framework can help you optimize your own CI/CD pipelines. It has certainly helped me over the years when reasoning about where to spend time and why.

Measuring and Improving your CI/CD Pipelines was originally published by Peter Goodman at PeteGoo on November 09, 2018.

Concourse on AWS: Worker lifecycle management

2018-04-16T00:00:00+12:00

Lately we’ve been running Concourse CI for a bunch of our builds. We really love Concourse for the pipeline features, ease of configuration, and docker primitives. However, operating and feeding Concourse can be a voyage of discovery and sometimes sadness.

One of the issues with Concourse is that it doesn’t really like it when workers disappear on it. The workers will appear as stalled if you run fly workers. This means that any resources that are performing check operations for new versions will be stuck and not trigger builds. You then need to prune-worker if you want your builds to keep working.

This post aims to give you the basics for getting lifecycle management a bit better so you can simply roll the instances in your worker pool Auto-Scaling Group (ASG) when you want to get some fresh ones without incurring the annoyance of having to clear out those stalled workers.

Lifecycle Hook

Hopefully you are running your Concourse workers in an Auto-Scaling Group. When your ASG removes these instances nothing will tell Concourse that they are dead. To make this happen you need to create an Auto-Scaling Lifecycle Hook.

Create a lifecycle hook for termination called worker-terminating.

Add the following script in a CRON job run every minute or two.

#!/bin/bash

# Need this path to allow aws command to work
PATH=$PATH:/usr/local/bin

instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id/)

lifecycleState=$(aws autoscaling describe-auto-scaling-instances --instance-ids $instance_id --query 'AutoScalingInstances[0].LifecycleState' --output text --region us-west-2)

if [ "$lifecycleState" == "Terminating:Wait" ]; then
  asg=$(aws autoscaling describe-auto-scaling-instances --instance-ids $instance_id --query 'AutoScalingInstances[0].AutoScalingGroupName' --output text --region us-west-2)
  
  # We store the TSA Host parameter
  TSA_HOST="my.tsa.host"

  concourse retire-worker \
  	--name $(hostname) \
  	--tsa-host $TSA_HOST \
  	--tsa-public-key /path/to/tsa-public-key \
  	--tsa-worker-private-key /path/to/tsa-worker-private-key

  # Sleep for 10 minutes to let the builds finish. I know, not ideal but it works for now
  sleep 10m

  service concourse-worker stop

  aws autoscaling complete-lifecycle-action \
    --instance-id $instance_id \
    --auto-scaling-group-name $asg \
    --lifecycle-hook-name "worker-terminating" \
    --lifecycle-action-result "CONTINUE" \
    --region us-west-2
fi

Concourse on AWS: Worker lifecycle management was originally published by Peter Goodman at PeteGoo on April 16, 2018.

Getting Packer to work for Windows on AWS

2016-05-10T00:00:00+12:00

Getting a Packer build to work with the AWS EBS builder is pretty easy. Getting it to work for Windows can be a series of less-than-obvious discoveries. I had issues trying to find a concise guide on how to get the various pieces working together, so here it is.

All code available here

The goal

We want Packer to create an EC2 AMI using a powershell initialization script. To achieve this Packer will create a new EC2 instance, run our script and then take an image of it before terminating our builder instance. We need any communication with the builder instance to use https rather than http so there is something approaching secure communication (although here we will use a self-signed cert, created on the instance itself).

Builder: amazon-ebs
Provisioner: powershell

Using the amazon-ebs builder

The amazon-ebs builder is actually pretty good. The configuration is well documented and the config will end up looking something like below:

{
    "builders": [{
        "type": "amazon-ebs",
        "region": "us-east-1",
        "source_ami": "ami-3d787d57",
        "instance_type": "m3.medium",
        "ami_name": "windows-ami ",
    }]
}

WinRM and the infinite sadness

The next issue is that we need to be able to add a provisioner so we can run some scripts on the new builder instance. On linux boxes this is pretty standard as ssh actually works. Unfortunately on Windows in order to run Powershell remotely on the Packer builder instance we have to use Powershell remoting and that means WinRM.

WinRM was originally designed for a world that was built on WS-*, SOAP and Kerberos authentication in Windows domains. Hence it has been plagued by configuration woes since it was first introduced. Getting it to work for Packer over the internet can be a pain.

So let’s tell Packer to use winrm.

{
    "builders": [{
        "type": "amazon-ebs",
        "region": "us-east-1",
        "source_ami": "ami-3d787d57",
        "instance_type": "m3.medium",
        "ami_name": "windows-ami ",
        "user_data_file":"./ec2-userdata.ps1",
        "communicator": "winrm",
        "winrm_username": "Administrator",
    }]

If you run this you will probably end up with the dreaded waiting for winrm to become available message from Packer that just sits there looking at you. This means that WinRM is not configured on the instance.

To resolve this problem we need to run a script on the builder instance to bootstrap WinRM. The way we tell an EC2 instance to run a script on first startup is the UserData script. On Windows this script can contain a <powershell></powershell> section.

<powershell>

write-output "Running User Data Script"
write-host "(host) Running User Data Script"

Set-ExecutionPolicy Unrestricted -Scope LocalMachine -Force -ErrorAction Ignore

# Don't set this before Set-ExecutionPolicy as it throws an error
$ErrorActionPreference = "stop"

# Remove HTTP listener
Remove-Item -Path WSMan:\Localhost\listener\listener* -Recurse

# WinRM
write-output "Setting up WinRM"
write-host "(host) setting up WinRM"

cmd.exe /c winrm quickconfig -q
cmd.exe /c winrm quickconfig '-transport:http'
cmd.exe /c winrm set "winrm/config" '@{MaxTimeoutms="1800000"}'
cmd.exe /c winrm set "winrm/config/winrs" '@{MaxMemoryPerShellMB="1024"}'
cmd.exe /c winrm set "winrm/config/service" '@{AllowUnencrypted="true"}'
cmd.exe /c winrm set "winrm/config/client" '@{AllowUnencrypted="true"}'
cmd.exe /c winrm set "winrm/config/service/auth" '@{Basic="true"}'
cmd.exe /c winrm set "winrm/config/client/auth" '@{Basic="true"}'
cmd.exe /c winrm set "winrm/config/service/auth" '@{CredSSP="true"}'
cmd.exe /c winrm set "winrm/config/listener?Address=*+Transport=HTTP" '@{Port="5985"}'
cmd.exe /c netsh advfirewall firewall set rule group="remote administration" new enable=yes
cmd.exe /c netsh firewall add portopening TCP 5985 "Port 5985"
cmd.exe /c net stop winrm
cmd.exe /c sc config winrm start= auto
cmd.exe /c net start winrm
cmd.exe /c wmic useraccount where "name='vagrant'" set PasswordExpires=FALSE

</powershell>

We can now try to run the packer build

packer build template.json

But WinRM still can’t connect?

If you still get the waiting for winrm to become available message and it doesn’t progress after a few minutes then something may have gone wrong in the above script. To diagnose that issue run packer with the debug flag.

packer build -debug template.json

Grab the Administrator login from the Packer output, you will need it. Then add an inbound RDP rule on the Packer build instance’s security group so you can RDP to it. Look for the log at C:\Program Files\Amazon\Ec2ConfigService\Logs\Ec2ConfigLog.txt. You may need to add logging in the above script to figure out what is going wrong.

But the security man!

OK, so this script is ok but the communication is over plain http which is a little less than ideal. To make this https we can generate a new certificate on the machine and use that. We switch the port to 5986 and tell WinRM we are using https.

<powershell>

write-output "Running User Data Script"
write-host "(host) Running User Data Script"

Set-ExecutionPolicy Unrestricted -Scope LocalMachine -Force -ErrorAction Ignore

# Don't set this before Set-ExecutionPolicy as it throws an error
$ErrorActionPreference = "stop"

# Remove HTTP listener
Remove-Item -Path WSMan:\Localhost\listener\listener* -Recurse

$Cert = New-SelfSignedCertificate -CertstoreLocation Cert:\LocalMachine\My -DnsName "packer"
New-Item -Path WSMan:\LocalHost\Listener -Transport HTTPS -Address * -CertificateThumbPrint $Cert.Thumbprint -Force

# WinRM
write-output "Setting up WinRM"
write-host "(host) setting up WinRM"

cmd.exe /c winrm quickconfig -q
cmd.exe /c winrm set "winrm/config" '@{MaxTimeoutms="1800000"}'
cmd.exe /c winrm set "winrm/config/winrs" '@{MaxMemoryPerShellMB="1024"}'
cmd.exe /c winrm set "winrm/config/service" '@{AllowUnencrypted="true"}'
cmd.exe /c winrm set "winrm/config/client" '@{AllowUnencrypted="true"}'
cmd.exe /c winrm set "winrm/config/service/auth" '@{Basic="true"}'
cmd.exe /c winrm set "winrm/config/client/auth" '@{Basic="true"}'
cmd.exe /c winrm set "winrm/config/service/auth" '@{CredSSP="true"}'
cmd.exe /c winrm set "winrm/config/listener?Address=*+Transport=HTTPS" "@{Port=`"5986`";Hostname=`"packer`";CertificateThumbprint=`"$($Cert.Thumbprint)`"}"
cmd.exe /c netsh advfirewall firewall set rule group="remote administration" new enable=yes
cmd.exe /c netsh firewall add portopening TCP 5986 "Port 5986"
cmd.exe /c net stop winrm
cmd.exe /c sc config winrm start= auto
cmd.exe /c net start winrm

</powershell>

Adding the provisioner

Finally we can add our provisioner to our template.json.

{
    "builders": [{
        "type": "amazon-ebs",
        "region": "us-east-1",
        "source_ami": "ami-3d787d57",
        "instance_type": "m3.medium",
        "ami_name": "windows-ami ",
        "user_data_file":"./ec2-userdata.ps1",
        "communicator": "winrm",
        "winrm_username": "Administrator",
        "winrm_use_ssl": true,
        "winrm_insecure": true
    }],

    "provisioners": [
        {
            "type": "powershell",
            "script": "init.ps1"
        }
    ]
}

Notice that we are now specifying winrm_use_ssl. The inclusion of winrm_insecure means that the Packer client will not verify the certificate chain which will obviously fail for our self signed certificate.

We can now add whatever setup we need into our init.ps1 script which will run over our (slightly more) secure WinRM connection.

The entire repo for this sample can be found at https://github.com/PeteGoo/packer-win-aws.

Getting Packer to work for Windows on AWS was originally published by Peter Goodman at PeteGoo on May 10, 2016.

Slides from my talk at codemania

2016-05-10T00:00:00+12:00

Speaking at codemania.io was one of the scariest and most awesome experiences of my career to date.

The concept of the talk was that the secret of going fast, safely, is to raise the visibility of the system you develop and operate. Below are the slides from that talk.

Links

Slides from my talk at codemania was originally published by Peter Goodman at PeteGoo on May 10, 2016.

Slides from 'Devops for the .Net developer'

2015-03-15T00:00:00+13:00

On February 19th I gave a talk at the Auckland.Net meetup titled “Devops for the .Net Developer”. The idea of this talk was to present the context that gave rise to the DevOps movement, outlining it’s drivers, principles and guiding practices and then frame all of this in terms that apply to the average .Net development shop. In other words, sharing a lot of the knowledge I have gained over the last year working in the DevOps space.

I think it was pretty well received. Below are the slides from that talk.

Links

Slides from 'Devops for the .Net developer' was originally published by Peter Goodman at PeteGoo on March 15, 2015.

Tips and tricks for integrating GitHub + TeamCity

2015-03-14T00:00:00+13:00

For a while now I’ve been using the GitHub + TeamCity + Slack combination and I though it would be useful to write down the various tactics and tools for getting the most out of this pretty common configuration of tools.

You don’t need to be using all these tools and services to get something out of these posts but the combination of all three can be pretty powerful.

This post will concentrate on getting the most out of TeamCity + GitHub

The first thing you probably have already done is to configure GitHub as a “VCS Root” in TeamCity. If you haven’t then follow the instructions to get it setup. Note that now you can create a project from a URL, this is often the easiest way to setup a new project around a GitHub repository.

Matching TeamCity users with GitHub users.

Obviously GitHub and TeamCity have their own lists of users. With GitHub and other git services like BitBucket it is important to understand that your GitHub user account is not automatically stamped against each commit that you make. In fact it is up to you to make sure that the correct name and email is configured in the git clone on your machine in order for these commits to be correctly attributed to you. GitHub will then do it’s best to show your avatar against your commits and track your stats by looking at your commits. If however this information is not configured in your account there is no easy way to update this information without rewriting history.

So the first step is to correctly configure your user.name and user.email git configuration.

git config user.name "Joe Bloggs"
git config user.email "joe.bloggs@example.com"

The next thing to do is to make sure that TeamCity is configured to correctly match this information with your TeamCity user. In the Advanced VCS Root settings you will see the following section:

In my experience the best thing to do is to leave this setting at the default (UserId). This will take the first part of the email configured above and use that to match against TeamCity usernames. This results in the most predictable behaviour as people tend to have various names configured but the email address will probably be quite consistent. We then tend to match our teamcity usernames with our company email address names.

If the usernames don’t match you can go to your user profile in TeamCity and customise the username that will be associated with all VCS roots, a specific VCS Root or even all Git VCS Roots.

The only place the above falls down is when you also use GitHub for personal projects and you end up committing with multiple different email addresses by accident because you have a global default set. Unfortunately TeamCity doesn’t allow you to setup multiple alternative usernames so some of your commits won’t get matches. Hopefully this will be resolved at some point in TeamCity.

Reporting build status to GitHub

One of the coolest features in GitHub is the ability to have your build process report progress to GitHub. The result of this is that your branches, commits and pull requests will be marked as pending, failed or succeeded. This really comes into it’s own with Pull Requests.

To enable TeamCity to be able to tell GitHub about the build status you need to download and install the TeamCity.GitHub plugin.

Note that you can upload plugin .zip files to the plugins folder using the administration pages on TeamCity, just remember to restart the service for the change to take effect. Also note that for the pull request part to work you will need to make sure you are building branches and PRs as required (see below).

Building branches and pull requests

By default TeamCity will probably only be building master. To enable other branches to get built you will have to also add a branch specification on the VCS Root settings.

The branch specification syntax takes a little getting used to but here are some useful examples. Note that each one should appear on a separate line.

+:<default> - include master
+:refs/heads/(*) - include all branches
-:refs/heads/(spikes-*) - exclude any branches that start with spike-
+:refs/pull/(*)/head - include all pull requests
+:refs/pull/(*)/merge - include the merge result of pull requests *(see below)

The parenthesis () allow you to specify the part of the branch syntax that will be used as the branch name in the TeamCity UI.

Building the merge result of pull requests with refs/pull/(*)/merge is a pretty cool idea. Basically it means that when GitHub knows that the potential merge result of a pull request would change then a build will trigger that not only looks at the PR but attempts to merge it into the parent branch as if someone had pressed the green merge button in GitHub, before building the code. This seems cool but there are number of problems mostly in that the builds will be triggered ALL THE TIME and your build queue gets swamped with all your PRs building. For example when someone looks at the PR on github.com it will trigger a new build if it detects that something could change in the merge result, we found that as people were skimming over PRs on github.com our TeamCity server got completed swamped. Therefore, we don’t use this feature, instead we just don’t keep long running feature branches.

I like that I can get @teamcity to auto-build PRs from @github. I really hate that looking at a PR on @github causes another build.
— Brad Wilson (@bradwilson) March 8, 2015

Note that if you setup a VCS Trigger to initiate your builds when someone has pushed code to GitHub then you can also also specify the same branch syntax in the trigger branch filter field.

Triggering new builds when someone pushed code

By default TeamCity can be configured with a VCS trigger that polls the git repository looking for changes. The only thing is that of course, after you push your code, you will have to wait until the poll interval triggers again.

If your TeamCity server can be reached on the open internet then you can ask GitHub.com to tell TeamCity that changes have been made the instant someone pushed coded to GitHub. To do this you need to go to the Settings of your repository then add the TeamCity service from the WebHooks and Services panel. It may require a username and password unless you have guest access enabled.

Using a Mac OS TeamCity agent with a Windows TeamCity Server

This is more of a warning around a very specific set of circumstances. If the following is true:

You have a Windows TeamCity server
You setup a Mac OSX TeamCity agent (e.g. to run iOS builds)
Your repo has symlinks in it (like in cucumber / calabash tests)

The JGit client used in TeamCity can be a royal pain-in-the-ass sometimes. In the above scenario it will turn those symlinks into useless empty files that freak your build out. You will need to change the VCS settings to “checkout on agent” instead of “checkout on server” meaning that the Windows server will not be trying to send the file changes to a Mac OSX client and failing horribly.

Links

Tips and tricks for integrating GitHub + TeamCity was originally published by Peter Goodman at PeteGoo on March 14, 2015.

TeamCity Slack Build Notifier

2014-07-13T00:00:00+12:00

We have become immense fans of Slack in our office, as have lot of people I know in the software development industry. If you haven’t heard of Slack, well it’s basically a chat system for businesses, much like HipChat or Campfire. The difference being that Slack seems to bring a healthy dose of cool to everything they do, they are iterating incredibly fast right now and seem to be hitting all the right notes.

Ok, so once you have Slack up and running, you turn to integrations with other systems that are going to maximise on the true ChatOps experience. There are already integrations with JIRA, New Relic, GitHub, Twitter and just about every other thing you can think of.

I started to look for a way to have TeamCity notify build results directly into Slack channels (rooms) and found that there were a few options I could have gone with. I could of course use my own chat bot project mmbot to do the notifications for me but I would either need to poll TeamCity continuously or use a webhooks plugin. There is a very good webhooks plugin available already for TeamCity, the only thing is it doesn’t support commit messages / users and it would bring in a chain of communication that is not strictly necessary. Nope, I wanted a plugin for TeamCity that would report directly to Slack.

There is a Slack plugin already for TeamCity but I wasn’t too keen on the way the notifications looked or the reliance on XML configuration to set it up for each project.

So I decided to take the plunge, learn some Java and setup a new plugin to report build status directly from TeamCity into a Slack room. Initially I started by taking a lot of inspiration from the tcWebHooks plugin I mentioned above. I really liked the configuration experience for this plugin and wanted that experience for my users.

I ended up using IntelliJ IDEA from JetBrains, this was by far the easiest IDE to setup although for a n00b it was still really painful in java-land. I’m not sure how much of this was pre-conception vs freaky hard configuration in the JDK etc. The build system is Maven and everything else is largely simple stuff.

As usual the code is on GitHub at PeteGoo/tcSlackBuildNotifier.

TeamCity Slack Build Notifier was originally published by Peter Goodman at PeteGoo on July 13, 2014.

A simple project dependency viewer

2014-07-05T00:00:00+12:00

Recently I faced an issue trying to get my head around a large codebase consisting of multiple solutions and many many projects. The difficulty was in trying to understand the interdependencies between these projects, espectially the ones that are in different solutions. There are tools to do this. NDepend has some neat stuff, Visual Studio Ultimate Edition can do some things and there are others. For my simple scenario though I couldn’t justify the licensing cost.

Luckily I knew that Visual Studio supports the DGML file format in all editions. DGML is essentially a file format where you specify a number of nodes and then links between them as below.

<?xml version='1.0' encoding='utf-8'?>
<DirectedGraph xmlns="http://schemas.microsoft.com/vs/2009/dgml">
  <Nodes>
    <Node Id="1" Label="MyCompany.Core" />
    <Node Id="2" Label="MyCompany.Area1.Service" />
    <Node Id="3" Label="MyCompany.Area1.Service" />
  </Nodes>
  <Links>
    <Link Source="2" Target="1" />
    <Link Source="1" Target="1" />
  </Links>
</DirectedGraph>

Open this file in Visual Studio and you get a nice designer view where you can arrange and change things as you require.

So I created a simple tool that will generate a diagram like this for you when you give it a folder. It will search the child folders for any csproj, vcxproj and vbproj files, calculate their references and give you the relevant DGML file for you to analyse your dependencies. Simple stuff really.

The repo and binaries are on github at PeteGoo/ProjectDependencyVisualiser. Enjoy.

A simple project dependency viewer was originally published by Peter Goodman at PeteGoo on July 05, 2014.

Moved blog to GitHub Pages and Jekyll

2014-04-27T00:00:00+12:00

Like most of the bloggers on the internet these days I’ve moved my blog off wordpress and onto GitHub pages using Jekyll.

There was a fairly large amount of Yak Shaving involved in this process. I’m not going to do a tutorial on how to move from Wordpress to Pages/Jekyll, you can find plenty of info on how to do that on the links below. I will however point out some of the things that threw me.

This is a good tutorial by Hadi Hariri
Follow the Jekyll migrations site for instructions on exporting the Wordpress XML. (The Wordpress plugin didn’t work for me)
Learn that Windows is a second class citizen in this toolset and you are going to have to shave some Yaks.
Make sure you install Ruby, Jekyll, Bundler
Set up your GitHub pages repo.
Choose a theme and pull it into your repo. Decide whether you are going to fork it or just pull it into your existing repo.
When importing ignore the Jekyll import site bash script and just use jekyll import wordpressdotcom --source wordpress.xml instead
Make sure to add the correct encoding for your site if you are on Windows.
Watch out for the problem with “{{” characters in your xaml. Broken by Liquid 2.0. Fixable by escaping.
Remember to keep changing the url in the _config.yml to suit your current deployment or strange things might happen.
DO NOT waste time trying to get FrontMatter defaults working. I couldn’t and gave up. Wasted sooooo much time on this. Instead I just changed the template to always switch on Disqus comments on posts.
Disqus support admitted that they currently have a problem with their import, hence my old comments are not there yet.
Learn to get permalinks right and use the jekyll-redirect-from plugin which GitHub pages supports.
Worst of all is the rss feed. My previous blog’s feed was at /index.php/feed/ while the default templates for jekyll puts it at /feed.xml. This wouldn’t be too hard but the redirect plugin for jekyll uses html based redirects rather than proper 301s, Feedly only likes 301s. Add to that the fact that GitHub pages doesn’t do .htaccess files and you have a recipe for disaster. Luckily there is a hack, create an index.php folder and a feed folder and inside put a copy of the feed.xml file renamed to index.html. Although the content-type of the feed response is now text/html it seems to work none-the-less.

Good Luck!

Moved blog to GitHub Pages and Jekyll was originally published by Peter Goodman at PeteGoo on April 27, 2014.