PeteGooJekyll2024-01-30T18:43:31+13:00https://blog.petegoo.com/Peter Goodmanhttps://blog.petegoo.com/blog@petegoo.comhttps://blog.petegoo.com/2024/01/27/incident-response-roles2024-01-27T00:00:00+13:002024-01-27T00:00:00+13:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>Other parts of this series on Incident Response:</p>
<ol>
<li><a href="https://blog.petegoo.com/2023/12/06/so-you-need-an-on-call-team/">So you need an on-call team</a></li>
<li><a href="https://blog.petegoo.com/2024/01/17/incident-response-severity-levels/">Severity Levels</a></li>
<li>Incident Response Roles (this article)</li>
</ol>
<p>In this, the 3rd part of the series on Incident Response, we’re going to cover arguably one of the most important aspects of incident response teams and that is distinguishing the roles that responders should assume.</p>
<p><img src="/images/2024/duke-of-york.jpg" alt="duke of york, belfast" /></p>
<h1 id="incident-commander">Incident Commander</h1>
<p>If you talk to almost any incident response team you will find that they commonly have identified the need for one person to orchestrate the efforts of the team handling the incident. In high-pressure environments there isn’t time for consensus driven decision making, random interruptions, or waiting for people to volunteer for tasks.</p>
<p>The Incident Commander is the person who is responsible for:</p>
<ul>
<li>Establishing lines of communication</li>
<li>Delegating tasks to other responders</li>
<li>Making decisions on severity, response, approach, lines of enquiry, and the response team</li>
<li>Situation Reports (SitReps will be covered in a separate post)</li>
<li>Concluding an incident</li>
<li>Evaluate the need for a blameless post-mortem and ensure it is done</li>
</ul>
<p>You will see that I mention “Lines of Communication” and “Lines of Enquiry”. To me this is a great model for guiding the actions of an incident response team. Often we forget to validate our assumptions, explore other options, and communicate with the right people. For more details on this model read my earlier post - <a href="https://blog.petegoo.com/2023/02/22/incident-response-lines-of-communication-enquiry/">Lines of Communication and Lines of Enquiry in Incident Response</a>. I won’t cover those details here.</p>
<h2 id="delegating-tasks-to-other-responders">Delegating tasks to other responders</h2>
<p>One of the key responsibilities of the Incident Commander (IC) is to make sure that tasks are delegated and assigned to responders. It is important that the IC has enough space to perform the other responsibilities outlined here, therefore they should not overcommit themselves to many Incident Response activities.</p>
<p>ICs need to be assertive in delegating tasks. They should try to avoid asking for volunteers. Nobody has time for that awkward silence while we wait for someone to put their hand up, instead assign the task to the most appropriate person. Itʼs up the IC to know each responders capabilities, what they are working on, and how to prioritise the incident response tasks.</p>
<h2 id="making-decisions-on-severity-response-approach-lines-of-enquiry-and-the-response-team">Making decisions on severity, response, approach, lines of enquiry, and the response team</h2>
<p>By this stage you should have identified severity levels that are important to your business if not read <a href="https://blog.petegoo.com/2024/01/17/incident-response-severity-levels/">part 2 in this series</a>. An Incident Commander should familiarise themselves with the Incident Severity Levels .</p>
<p>You can change the severity levels during an incident. Sometimes the Severity can dictate the appropriate level of incident response so make sure you evaluate the severity regularly throughout the duration of the incident.</p>
<h2 id="the-response-team">The Response Team</h2>
<p>The Incident Commander chooses the members of the team.</p>
<ul>
<li>
<p>Do we need to bring new members into the team for more specialised expertise or to add
more hands?</p>
</li>
<li>
<p>What specific roles are each of the team performing?</p>
</li>
<li>
<p>Are members of the team fatigued or have other commitments and need to be replaced?</p>
</li>
</ul>
<h2 id="identifying-the-incident-commander">Identifying the Incident Commander</h2>
<p>Not everyone will want to take on the responsibility of being an Incident Commander. It can be a stressful, tough (but rewarding), role and some folks may not feel inclined or ready to take it on. Preferably there should be a list of pre-trained and/or experienced staff who have knowledge of the procedures and expectations of being an incident commander.</p>
<h1 id="operator">Operator</h1>
<p>The operator is the most obvious role in an incident response team. They are the person who is doing the work to resolve the incident. They are the person who is typing the commands, running the scripts, and making the changes.</p>
<p>At the beginning of an incident this may actually be the only role until we know that we have a significant enough incident to warrant an Incident Commander and Scribe.</p>
<p>This is often where people are most comfortable. Though be careful, without any designated Incident Commander you can only really have two Operators before things get really hairy.</p>
<h1 id="scribe">Scribe</h1>
<p>I always tell this story of how we identified the need for this role. It came from the behaviours of one of our QAs at a previous role. When we had an incident they would calmly slide over beside the folks involved and start taking notes in Sublime Text. At the time they used a plugin that noted the time beside each line. Later they would contribute during incident reviews by referring to the record of events.</p>
<p>Now we can just use Slack for this. On any incident, if there any actions to be taken make sure that someone performs the role of “scribe”. This is especially important if the response team is distributed, on a video call. Call out notes for the scribe to add to the record. For example:</p>
<ul>
<li>Observations that have been made</li>
<li>Actions we are taking</li>
<li>Expectations that we have of those actions</li>
<li>Assumptions we have made so far</li>
<li>Impact analysis</li>
</ul>
<p>Treat it like the court reporter in a trial. They are there to record the facts and observations, not to interpret them. The scribe should not be making any decisions or recommendations.</p>
<p>The output that the scribe produces will serve as the vital component of a Blameless Post Mortem - a true record of the timeline that will help us to understand what happened and why.</p>
<h1 id="optional-roles">Optional Roles</h1>
<p>The following roles are not necessarily something you need but they can be useful in some situations.</p>
<h2 id="impact-analysis">Impact Analysis</h2>
<p>Impact Analysis can be a very detailed task. In special cases I found it useful to spin out one or more people to gather this data for Customer Success to contact or to guide the response plan.</p>
<p>They likely will have to dig deep on observability tooling, logs, a data lake, production databases etc.</p>
<h2 id="executive-communications">Executive Communications</h2>
<p>This is a little more common. It is an incredibly good idea to set the expectation with senior stakeholders that they should stay well away from the incident response team. Nobody needs the CEO/CTO rocking up into an incident response call and asking scary questions.</p>
<p>Typically the Incident Command will handled these requests but if the incident is large enough, is moving fast enough, and the number of stakeholders is large enough it can be useful to have someone dedicated to this role.</p>
<h2 id="customer-communications">Customer Communications</h2>
<p>This is a very important role. If you have a Customer Success team then they should be the ones to handle this. If not then you should have someone who is responsible for communicating with customers.</p>
<h1 id="future-posts">Future Posts</h1>
<p>In future posts in this series we will cover:</p>
<ul>
<li>Situation Reports</li>
<li>Incident Response Playbooks</li>
<li>Reporting on Incidents</li>
<li>(Blameless) Postmortems</li>
<li>Paying people for on-call</li>
</ul>
<p><a href="https://blog.petegoo.com/2024/01/27/incident-response-roles/">Incident Response Part 3: Roles</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on January 27, 2024.</p>https://blog.petegoo.com/2024/01/17/incident-response-severity-levels2024-01-17T00:00:00+13:002024-01-17T00:00:00+13:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>Other parts of this series on Incident Response:</p>
<ol>
<li><a href="https://blog.petegoo.com/2023/12/06/so-you-need-an-on-call-team/">So you need an on-call team</a></li>
<li>Severity Levels (this article)</li>
<li><a href="https://blog.petegoo.com/2024/01/27/incident-response-roles/">Incident Response Roles</a></li>
</ol>
<p>In <a href="https://blog.petegoo.com/2023/12/06/so-you-need-an-on-call-team/">Part 1</a> of this series we covered why you might need an on-call team. In this post we will cover how to define severity levels for incidents. This is crucial in order to understand the impact of the incident on your customers and your business. This in turn will help your on-call team, your leadership team, and the rest of the business understand how to respond and how to communicate the incident.</p>
<p><img src="/images/2024/dangerous-for-swimming.jpg" alt="delboy mobile" /></p>
<h1 id="what-are-severity-levels">What are Severity Levels?</h1>
<p>Severity Levels are a way of describing how important an incident is to your business. This may mean that the impact to customers is high, or the impact to revenue is high, reputational damage is likely, or any other factor that is important in your context.</p>
<p>Typically these are numbers from 1 to 5 or more, with 1 being the most severe and 5 being the least severe. I think it is best to avoid words like “Critical”, “High”, “Medium”, “Low”, and “Informational” as these are all relative and can be interpreted differently depending on your background, experience and language/dialect.</p>
<p>Also, don’t make them zero-based. People will end up using the term P0 or Sev0 in extreme situations or in jest but there’s no need to formalise it, it will just confuse the non-tech people.</p>
<h1 id="why-is-it-important-to-define-severity-levels">Why is it important to define Severity Levels?</h1>
<p>Incident response can be an extremely fast moving and stressful situation. Often when we are in the throes of an incident it can be difficult to contextualise the level of response required in relation to the impact of the incident itself and even the potential impact of our actions. If you do it often enough you can easily fall into the trap of burning out your team or, in the other extreme, being overly complacent through familiarity.</p>
<p>Severity Levels help us to understand a very key aspect of incident response - who we should communicate the incident with and how. I talked about this at length in my post <a href="https://blog.petegoo.com/2023/02/22/incident-response-lines-of-communication-enquiry/">Lines of Communication and Lines of Enquiry in Incident Response</a>. I won’t repeat too much of that here but I encourage you to read that post if you haven’t already. In short, one of the most detrimental failures you can make in incident response is to fail to let the right people know that an incident is happening.</p>
<p>Severity levels can also impact how we respond to an incident. For example - it may be ok to leave an incident for a few hours or overnight if no customers are affected and we have a mitigation in place, or if the feature is seldom used. On the other hand we may decide for some incidents that we need to disable key features, communicate with customers, even sacrifice availability in very rare cases.</p>
<h1 id="how-do-i-define-severity-levels">How do I define Severity Levels?</h1>
<p>Severity levels are very likely unique to your context. The best way to define them is to sit down with representation from across the business - engineering, product, customer support, sales, legal etc and agree on what makes sense. You could use collaborative tools like Miro/Mural or a plain old whiteboard, then add cards for typical outages you have had in the past or can foresee in the future. Assign them to severity levels and then discuss and iterate until you have a set of levels that make sense to everyone.</p>
<p>What you are looking to end up with is a table much like the following:</p>
<table>
<thead>
<tr style="border-bottom:1pt solid black;">
<th>Severity</th>
<th>Sev 6</th>
<th>Sev 5</th>
<th>Sev 4</th>
<th>Sev 3</th>
<th>Sev 2</th>
<th>Sev 1</th>
</tr>
</thead>
<tbody>
<tr style="border-bottom:1pt solid darkgrey;">
<td style="border-right:1pt solid darkgrey;"><strong>Description</strong></td>
<td style="border-right:1pt solid darkgrey;">Internal Impact Only <br /><br /> No customers impacted</td>
<td style="border-right:1pt solid darkgrey;">Problems reported with non-core functions</td>
<td style="border-right:1pt solid darkgrey;">Customer confusion to small subset of customers <br /><br /> Background jobs failing <br /><br /> Could become P2/P3</td>
<td style="border-right:1pt solid darkgrey;">Issue affecting a small group of customers <br /><br /> Redundancy loss with no impact <br /><br />Security near-miss</td>
<td style="border-right:1pt solid darkgrey;">Affects large number of customers or a Top 10 customer <br /><br />Functionality severely impaired</td>
<td>A serious event affecting most customers. <br /><br /> Generally unavailable <br /><br /> Impairs ability to perform key tasks <br /><br /> Security event e.g breach/disclosure</td>
</tr>
<tr style="border-bottom:1pt solid darkgrey;">
<td style="border-right:1pt solid darkgrey;"><strong>Typical Examples</strong></td>
<td style="border-right:1pt solid darkgrey;">-</td>
<td style="border-right:1pt solid darkgrey;">-</td>
<td style="border-right:1pt solid darkgrey;">-</td>
<td style="border-right:1pt solid darkgrey;">-</td>
<td style="border-right:1pt solid darkgrey;">-</td>
<td>-</td>
</tr>
<tr style="border-bottom:1pt solid darkgrey;">
<td style="border-right:1pt solid darkgrey;"><strong>Response</strong></td>
<td style="border-right:1pt solid darkgrey;">-</td>
<td style="border-right:1pt solid darkgrey;">Inform Customer Success</td>
<td style="border-right:1pt solid darkgrey;">Inform Customer Success <br /><br /> Inform Engineering Leadership (VPE)</td>
<td style="border-right:1pt solid darkgrey;">Inform Engineering Leadership (VPE+CTO) <br /><br /> Implement in-product notifications of issue</td>
<td style="border-right:1pt solid darkgrey;">Executive Leadership Team<br /><br />Raise Status Page</td>
<td>Notify Executive Leadership Team <br /><br />Notify Board<br /><br /></td>
</tr>
</tbody>
</table>
<h1 id="how-do-i-evaluate-the-severity-level-of-an-incident">How do I evaluate the Severity Level of an incident?</h1>
<p>Keep a link to the above table in your incident response documentation. When an incident occurs, evaluate the impact of the incident against the table. If you are unsure, err on the side of caution and escalate to the next severity level.</p>
<p>Add examples that are relevant to you and your business. These should be regularly revised.</p>
<h1 id="how-do-i-balance-the-need-for-impact-analysis-with-the-need-to-respond-quickly">How do I balance the need for impact analysis with the need to respond quickly?</h1>
<p>It can be very difficult to evaluate the severity level of an incident in the heat of the moment. Often you are trying to prioritise stablilising the system over other seemingly non-critical tasks. This is one of the reasons why it is very useful to <a href="https://blog.petegoo.com/2023/12/06/so-you-need-an-on-call-team/">have more than one responder to an incident</a>. With multiple responders you can spin someone out to assess the impact of the incident and communicate with the rest of the business.</p>
<p>Failure to assess the severity level can result in substandard communication protocols, disgruntled customers, and even a lack of trust in the on-call team. Aspire to have very clear guidelines on what constitutes a severity level. Keep the document up to date with typical prior examples so that these can more easily be assessed in the moment.</p>
<h1 id="future-posts">Future Posts</h1>
<p>In future posts in this series we will cover:</p>
<ul>
<li>Situation Reports</li>
<li>Incident Response Playbooks</li>
<li>Reporting on Incidents</li>
<li>(Blameless) Postmortems</li>
<li>Paying people for on-call</li>
</ul>
<p><a href="https://blog.petegoo.com/2024/01/17/incident-response-severity-levels/">Incident Response Part 2: Severity Levels</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on January 17, 2024.</p>https://blog.petegoo.com/2023/12/06/so-you-need-an-on-call-team2023-12-06T00:00:00+13:002023-12-06T00:00:00+13:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>Other parts of this series on Incident Response:</p>
<ol>
<li>So you need an on-call team (this article)</li>
<li><a href="https://blog.petegoo.com/2024/01/17/incident-response-severity-levels/">Severity Levels</a></li>
<li><a href="https://blog.petegoo.com/2024/01/27/incident-response-roles/">Incident Response Roles</a></li>
</ol>
<p>As your product gains traction and expectations from customers increase, you may find that at some point things start failing. You see one or more of the following signs start to accumulate:</p>
<ol>
<li>You find out about issues from your customers, not your team or your own monitoring systems.</li>
<li>These issues are increasingly appearing outside of your working hours and sit unresolved until the next day/week.</li>
<li>Your teams are fully committed to new features and are struggling to find time to fix issues.</li>
<li>You find it hard to prioritise issues because you don’t have a clear understanding of the impact.</li>
<li>You have no clear ownership of production issues so they get passed around between teams.</li>
</ol>
<p>You may further find that these failures are starting to cost you in terms of the trust you have with your customers, the tension between people and roles internally, and possibly even the health of your engineering team.</p>
<p>You need an on-call team.</p>
<p><img src="/images/2023/delboy_mobile.jpg" alt="delboy mobile" /></p>
<h1 id="what-is-an-on-call-team">What is an on-call team?</h1>
<p>An on-call team is a group of people who are responsible for responding to production issues. They are the first line of defence when things go wrong. They are the people who are woken up in the middle of the night when your systems fail. They are the people who are responsible for restoring service to your customers.</p>
<h1 id="do-i-need-to-hire-another-team-of-people-for-on-call">Do I need to hire another team of people for on-call?</h1>
<p>No.</p>
<p>In a lot of traditional organizations, the on-call team is a separate team of people who are responsible for responding to production issues. This is a terrible idea. It creates a divide between the people who build the systems and the people who run the systems. It creates a culture of “throwing things over the wall” and “not my problem”. It creates a culture of “us vs them”. It creates a culture of “I don’t care about the quality of my work because I’m not the one who has to fix it when it breaks”.</p>
<p>Coda Hale outlines this beautifully in his talk Metrics, Metrics Everywhere:</p>
<blockquote>
<p>“Our code generates business value <strong>when it runs</strong>, not when we write it”.</p>
</blockquote>
<p>In other words we should really care about what our code is doing when it runs because that is when it is doing it’s job. If you don’t then you’re creating art, not business value.</p>
<p>There’s another aspect at play here and that is that the people who are responsible for creating the issue are the ones who are best placed to fix it. They are the ones who have the context and the knowledge to understand the problem. They are the ones who are best placed to learn from the issue and to prevent it from happening again. If you want an efficient engineering organization then you need to shorten the time from impact to learning and then outcome (more reliable software). You can only do this if the people who feel the pain are the ones who can alleviate that same pain. Separate teams creates weird power dynamics and misaligned incentives.</p>
<h1 id="so-should-everyone-be-on-call">So should everyone be on-call?</h1>
<p>It depends. If you have a very clear service architecture and a big budget you can have a rotation in each team though this can get prohibitive. I think that, if you have a separate SRE or Platform Engineering team then it makes sense for most of those folks to be on-call as a lot of the incidents that occur will need some insight into the underlying platform/infrastructure. Your service/product/program teams can be a little more fluid depending on how homogenous your services are, and how well the teams communicate changes and risks.</p>
<p>If you have a small org then just do what you can. See below for how to organize rotations and ideal rotation size.</p>
<h1 id="how-do-i-convince-my-engineers-to-go-on-call">How do I convince my engineers to go on-call?</h1>
<p>There are a number of ways of doing this and it depends the resources (budget) you have available to you, the size of your engineering team, the level of trust you have with your engineers, and the amount of empathy they have for each other and your customers.</p>
<ol>
<li>
<p>Start with the most dedicated, driven people</p>
<p>Chances are you probably have some people already on your team who are driven, care deeply about your customers, and are willing to go the extra mile to make sure things are working. These are the people you want to start with. They are the ones who will set the tone for the rest of the team.</p>
</li>
<li>
<p>Pay people for their time</p>
<p>If you have the budget you should pay people for their time because it’s the right thing to do. Here in New Zealand this is fairly easy to do. In the US this can be a little harder but we’ll cover this in more details in a future post.</p>
</li>
<li>
<p>Give them time in lieu for time spend responding out of hours.</p>
<p>Regardless of whether you pay people for their on-call time or not, when someone is called out of hours to respond to an issue you should give them back that time by allowing them to reclaim it from their working hours. If you are paying them for being on-call chances are it’s not their salaried rate any ways. It also will go a long way to helping them justify the disruption to their personal lives, those of their partners and kids.</p>
<p>Further, in my experience, this is something you have to reinforce. People will try to be heroes and power through. Gently remind them that they need to take time to recover.</p>
</li>
<li>
<p>If you can, always respond in pairs</p>
<p>We learned this at a previous company and it served us really well. <a href="https://www.cnbc.com/2019/02/28/what-google-learned-in-its-quest-to-build-the-perfect-team.html">Psychological safety at work is incredibly important</a>. Psychological safety at 3am when things are on fire is even more important. Two pairs of eyes are infinitely more reliable, safe, and effective than one. Having a copilot to make sure you’re typing the right command, clicking the right button, shutting down the right server, or whatever, is invaluable.</p>
</li>
<li>
<p>Make sure you have a clear escalation path</p>
<p>As for #4, make sure people know that they are not alone and they can always escalate. That typically means that you will be contactable yourself. You need to make it ok and make it something that you would rather they did in a time of uncertainty than not.</p>
</li>
<li>
<p>Recognise and praise the on-call team regularly</p>
<p>When there is a significant incident make sure to publicly praise the on-call team, and thank them for their contribution, no matter the outcome. Other people see this and it helps to build empathy for those that are on rotation.</p>
</li>
<li>
<p>Have a company phone plan for the on-call team</p>
<p>They may well have to hotspot whereever they are so make sure they have a company phone plan that covers this. It’s a small thing but it’s a nice thing to do.</p>
</li>
</ol>
<p>One of the surprising outcomes that I notice about on-call teams is that the people on-call have a much better mental model for the way that the software works. They are more active in architectural and design discussions as a result and they tend to be more effective generally. This means they get promoted faster and this gets noticed.</p>
<p>We often would find that we had a queue of people who had expressed interest in joining the on-call team. This was true even before we paid people an hourly rate for being on-call. When I asked people why they wanted to join the rotation they would tell me the same thing - it was seen as a great learning opportunity and way to grow their career.</p>
<p>To be clear, you don’t promote people because they are on-call, you promote them when they become more effective at their jobs.</p>
<p>Another observation I made though was that this good will erodes very quickly if the on-call team is getting woken up constantly, are unable to effect the outcomes they need, and are generally getting beaten up night after night. You still need a good incident response, continuous improvement, learning, and blameless culture in order to make this work.</p>
<h1 id="what-about-the-people-who-dont-want-to-go-on-call">What about the people who don’t want to go on-call?</h1>
<p>We all have lives and different priorities. Listen to their context and apply some empathy. They might have young kids, be caring for a dependent, dealing with a health issue, shared living spaces, who knows.</p>
<h1 id="organizing-rotations">Organizing rotations</h1>
<p>This is highly dependent on your specific product and team topology but here are a few guiding principles:</p>
<ol>
<li>Healthy rotations in my experience are 1 week at a time, every 4-7 weeks. This gives enough time for recovery while not being too far apart that you forget how things work or lose context on what has changed.</li>
<li>Rotations should preferably be in pairs. This is for psychological safety and to make sure that you have a copilot to help you out.</li>
<li>Swaps will happen but you need to set some ground rules like no more than two weeks on-call for any individual.</li>
<li>Handover rotations during the week, during working hours. Tuesday midday for example is good.</li>
<li>Christmas and New Years needs special treatment. We would shorten this to 1 or 2 day rotations and make sure that the inconvenience of being on for key days like Christmas day and New Years day was spread out.</li>
<li>Set expectations around taking a laptop everywhere you go and not drinking alcohol or partaking in other mind altering substances while on-call.</li>
</ol>
<p>At a previous company we would have an SRE team and a product specific team member paged for each incident.</p>
<p>For example:</p>
<p>Product 1 alert -> SRE + Product 1 team member alerted</p>
<p>Product 2 alert -> SRE + Product 2 team member alerted</p>
<p>Very rarely did we have overlapping incidents unless there was a cloud provider failure in which case we merged the incidents anyways.</p>
<h1 id="future-posts">Future Posts</h1>
<p>In future posts in this series we will cover:</p>
<ul>
<li>Situation Reports</li>
<li>Incident Response Playbooks</li>
<li>Reporting on Incidents</li>
<li>(Blameless) Postmortems</li>
<li>Paying people for on-call</li>
</ul>
<p><a href="https://blog.petegoo.com/2023/12/06/so-you-need-an-on-call-team/">Incident Response Part 1: So you need an on-call team</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on December 06, 2023.</p>https://blog.petegoo.com/2023/09/21/consensus-momentum-dissent2023-09-21T00:00:00+12:002023-09-21T00:00:00+12:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>In the early days of a close-knit team of similarly experienced people you have this amazing ability to make decisions quickly. Remember that? You need to move fast and not sweat the small things. You have really important challenges to solve and you accept that getting some things wrong initially is very much ok.</p>
<p>As your team grows into many teams…guilds…chapters…or whatever (no judgement), you find that each of these small decisions are taking longer to make. You start to see that the team is spending more time discussing and debating these small decisions and you start to wonder if you should be doing something about it.</p>
<p>The thing that tends to happen at this point is that consensus has become the order of the day.</p>
<p><img src="/images/2023/football-odd-one-out.jpg" alt="consensus" /></p>
<p>There is this weird belief that consensus is a great thing to have in a group of people. I guess it is, but it only happens if the group is very, very small. Why? <strong>Because instances of consensus trend towards zero as the size of the group increases</strong>.</p>
<p>This is why many teams have discovered an alternative approach - <strong>invert the problem and instead of chasing consensus, look for dissent</strong>. If you look at the Netflix Culture and Valued Behaviours you will find this resonates with their behaviour of “Informed Captains”</p>
<blockquote>
<p>For every significant decision, we identify an informed captain of the ship who is an expert in their area. They are responsible for listening to other people’s views and then making a judgment call on the right way forward. We avoid decisions by committee, which would slow us down and diffuse responsibility…</p>
<p>…On big strategic issues, the captain farms for dissent and other alternatives to ensure they are truly informed. Dissent can be difficult, which is why we make an effort to stimulate discussion…We don’t wait for consensus or vote by committee, nor do we drive to rapid, uninformed decision making…The bigger the decision, the more extensive the debate. Afterwards, as the impact becomes clearer, we reflect on the decision and see how we could do even better in the future.</p>
</blockquote>
<p>[<a href="https://jobs.netflix.com/culture">source</a>]</p>
<p>Similarly, Amazon discuss how they “bias for action” and “disagree and commit”</p>
<blockquote>
<p>Speed matters in business. Many decisions and actions are reversible and do not need extensive study. We value calculated risk taking.</p>
<p>Leaders are obligated to respectfully challenge decisions when they disagree, even when doing so is uncomfortable or exhausting. Leaders have conviction and are tenacious. They do not compromise for the sake of social cohesion. Once a decision is determined, they commit wholly.</p>
</blockquote>
<p>[<a href="https://www.amazon.jobs/content/en/our-workplace/leadership-principles">source</a>]</p>
<p>I really love the term “farming for dissent”. It recognises that there is work involved in getting people to speak up and that it is an active, not passive, approach.</p>
<p>In Amazon’s case this idea of reversible decisions is best described by the term “one way door vs two way door”. I’ve found this approach really useful in evaluating the risk of making a particular decision.</p>
<h1 id="how-can-farming-for-dissent-go-wrong">How can farming for dissent go wrong?</h1>
<p>The key thing to remember is that dissent is not about being contrarian. It’s not about being difficult or awkward. It’s about being informed and having a different perspective.</p>
<p>Another failure mode of this approach is that it often favours the “loudest voice in the room”. Being loud and opinionated is not the same as being informed and having a different perspective. It’s important to make sure that you are hearing from all voices in the room and that you are giving people the space to be heard. This is a skill that leaders need to develop.</p>
<h1 id="building-consensus">Building Consensus</h1>
<p>I firmly believe that you need to have some level of confidence in your proposal before you put it out there for broad dissent. Why? Because physcological safety is a basic need and we all feel a little impostor syndrome and self-doubt. Also some people can suck at delivering constructive feedback.</p>
<p>The answer is to socialise your ideas with a wing-person or two. This is a great way to get some feedback and to build confidence in your ideas. It’s also a great way to get some feedback on how you are presenting your ideas. You can figure out how best to land the message and how to make sure you are heard.</p>
<p>I often form strong opinions after layers of socialising my thoughts through consecutive circles of trust. By the end I have stronger reasoning and greater confidence.</p>
<h1 id="so-how-do-i-know-ive-made-an-informed-decision">So how do I know I’ve made an informed decision?</h1>
<p>Basically you are looking for the goldilocks effect of feedback. You want to make sure that you have enough feedback to make an informed decision but not so much that you are paralysed by it.</p>
<p>Don’t just stick to your wing-person or your own echo chamber and actively seek dissention. Discard what is irrelevant and focus on the feedback that challenges your assumptions.</p>
<h1 id="summary">Summary</h1>
<p>Consensus has broken many great organisations. Try to invert the model, seek feedback and farm for dissent. Then commit to a decision, document it, and move on.</p>
<p><a href="https://blog.petegoo.com/2023/09/21/consensus-momentum-dissent/">Stop chasing consensus, start building momentum and farming for dissent</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on September 21, 2023.</p>https://blog.petegoo.com/2023/09/14/zero-trust-proxies-aws-alb2023-09-14T00:00:00+12:002023-09-14T00:00:00+12:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>Traditional VPNs have been the goto solution for many companies when considering how best to secure access to their internal tools on the public internet. With the widespread adoption of the hybrid office and remote working in the covid era the use of VPNs has significantly increased.</p>
<h2 id="y-u-no-vpn">Y U NO VPN?</h2>
<p>However the traditional VPN approach has come under scrutiny in recent years due to a number of fundamental flaws in their design and implementation.</p>
<h3 id="tunnelcrack-and-other-vulnerabilities">TunnelCrack and other vulnerabilities</h3>
<p>In August of 2023 a vulnerability present in most VPNs was uncovered which could allow an attacker to convince a VPN client that a secured site behind the VPN was actually a local resource. Once in place the attacker could essentially steal any data that was intended for the target. This vulnerability was coined “TunnelCrack” and serves to represent one of the main flaws of the typical VPN architecture as we’ll see later - that not all networks should be treated the same.</p>
<p>If you visit the CVE databases you will also find <a href="https://www.cvedetails.com/vulnerability-list/vendor_id-3278/Openvpn.html">endless</a> <a href="https://www.cvedetails.com/vulnerability-list/vendor_id-628/product_id-12675/Sonicwall-Global-Vpn-Client.html">disclosures</a> <a href="https://www.cvedetails.com/vulnerability-list/vendor_id-12126/product_id-112852/version_id-687403/Amazon-Aws-Client-Vpn-2.0.0.html">of</a> <a href="https://www.cvedetails.com/vulnerability-list/vendor_id-16/product_id-1887/Cisco-Vpn-Client.html">vulnerabilities</a> in just about any VPN implementation out there. Unpatched, one of these could become an existential threat for any network architecture.</p>
<h3 id="network-level-access">Network level access</h3>
<p>Typically VPNs give you access to a network, or part of that network. They essentially route all traffic bound for a certain subnet or significant range of IP addresses over the VPN. This has an unfortunate side effect that the type of traffic is unbounded. Often though, we know that the individual applications we want to expose over the VPN have a single type of traffic, like simple web applications that use https, and we don’t need the rest of the network to be exposed.</p>
<p><img src="/images/2023/vpn-example.png" alt="vpn-example" /></p>
<p>It’s like the difference between allowing someone to make a phone call to a person in your organisation through your switch board vs driving the caller in an armoured car to your office door, then letting them loose inside.</p>
<p>Now your particular network topology likely has a way to limit these impacts. If you are using AWS for example, you can minimise this by implementing security groups, cross-referencing security groups to allow fine grained network level access but this can be difficult to manage and can become impossible at scale due to inherent limits in the number of rules you can define.</p>
<h3 id="the-local-network-becomes-inherently-trusted">The local network becomes inherently “trusted”</h3>
<p>When we take this border-focused approach to our network topolgy there’s a really interesting side effect that we see. This design drives us down a path where we treat anything “inside” the network in which our applications reside as “trusted”. Once inside that network there is little to stop you moving around. Hence compromising a VPN connection can lead to very dire consequences.</p>
<p>What if we were to assume that no networks are inherently “safe”. Well, this is where Zero-Trust comes in.</p>
<h3 id="aside-why-not-yolo-your-apps-on-the-public-web">Aside: Why not YOLO your apps on the public web?</h3>
<p>Before we tackle zero-trust we need to answer this question. It might seem like a good idea to put your applications on the public web, I mean they all can implement authentication, right?</p>
<p>Well, here’s the thing: <strong>I don’t trust anybody to secure the entire surface area of their web application</strong>.</p>
<p>What do I mean by that? Well, let’s take an example - you have a build server of some sort, maybe it’s Jenkins, that has it’s own login page so you put it on the public web. Now we all know that it needs to implement good brute force defence, well thought out password reset flows etc so maybe we choose the SAML or OIDC option that it implements so we can defer all that stuff to our Enterprise IdP like Okta. Problem solved, right? Now Okta takes care of all of our authentication concerns, right? Right?</p>
<p>Well, what about all those other endpoints and pages on that app - have they remembered to implement authentication on all of those and make sure none of them are exposing unauthenticated functionality or other vulnerabilities? What about in the next update, with the next feature, and the one after that?</p>
<p><em>Incidentally, choose OIDC over SAML if you can. There are many flawed SAML implementations out there and OIDC is more performance-friendly and easier to implement.</em></p>
<h2 id="zero-trust">Zero-Trust</h2>
<p>I first came across the term Zero Trust when Google <a href="https://www.beyondcorp.com/">published the BeyondCorp set of guidance</a>. The major aha moment for me was the idea that, no matter if you were in a Google building or working remotely, your access to applications necessary to do your job was via the same set of controls. The major feature of those controls was an “Access Proxy”, sometimes called an “Identity-Aware Proxy”.</p>
<p><img src="/images/2023/access-proxy-example.png" alt="access-proxy-example" /></p>
<p>The idea of this proxy is that it acts as a gateway so no traffic gets through to the target app unless it has been authenticated. There’s one implementation with a very small surface area.</p>
<p>When I first started playing with Zero Trust I was using <a href="https://github.com/oauth2-proxy/oauth2-proxy">oauth2-proxy</a>. It was fine but I had to run it myself on an EC2 instance, make sure it was the latest version and generally feed it myself.</p>
<h2 id="zero-trust-access-proxies-in-aws">Zero Trust Access Proxies in AWS</h2>
<p>In AWS we typically host our applications behind an Application Load Balancer (ALB). This allows us to choose to run the application in a container or on an EC2 instance while at the same time scaling it out and offloading TLS and other concerns.</p>
<p>A feature of the listener rules within an ALB are that you can specify that no traffic is allowed through unless the client <a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/application/listener-authenticate-users.html">has been authenticated via OIDC or Amazon Cognito</a> (which in turn can support social login, SAML etc).</p>
<p><img src="/images/2023/aws-alb-rv-proxy.png" alt="Diagram of a AWS ALB Architecture" /></p>
<p>In this way, no traffic is allowed past the ALB unauthenticated. You can even pass through the <a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/application/listener-authenticate-users.html#user-claims-encoding">access token, identity and claims as headers</a> to the target application once authenticated.</p>
<p>A great benefit of this approach is that now AWS is responsible for patching and guaranteeing the safety of this approach and scaling the hardware. In compliance terms you’ve moved more things into their side of the responsibility matrix.</p>
<h3 id="aws-verified-access">AWS Verified Access</h3>
<p>A newer service from AWS that abstracts all of the ALB configuration away so that you can deploy a private ALB and still proxy authenticated traffic through to your applications is <a href="https://aws.amazon.com/verified-access/">AWS Verified Access</a>.</p>
<p>This effectively allows you to do the same thing but you don’t have to configure OIDC on each load balancer and instead can centrally configure that once. You can also use <a href="https://aws.amazon.com/blogs/opensource/using-open-source-cedar-to-write-and-enforce-custom-authorization-policies/">Cedar</a> policies and device claims like MDM certificates to further implement your Zero Trust posture. The only downside here for me is the cost. AVA will cost at least US$200 / month / application whereas an ALB is only US$23 or so, depending on configuration and usage.</p>
<h2 id="zero-trust-access-proxies-in-gcp">Zero Trust Access Proxies in GCP</h2>
<p>In GCP you can implement something very similar using Google’s <a href="https://cloud.google.com/iap">Identity Aware Proxy service</a>.</p>
<p><img src="/images/2023/iap-load-balancer.png" alt="Google IAP" /></p>
<h2 id="zero-trust-access-proxies-in-azure">Zero Trust Access Proxies in Azure</h2>
<p>In Azure the closest thing is (I think) <a href="https://learn.microsoft.com/en-us/azure/active-directory/app-proxy/what-is-application-proxy">Azure AD App Proxy</a>, although I’m not as familiar with this.</p>
<h2 id="summary">Summary</h2>
<p>So really, if all you need to do is safely secure some web-based applications like your CI/CD tools, reporting, and other internal tooling but keep them accessible over the public internet then you’re better off dropping the VPN and implementing a Zero-Trust Access Proxy / Authenticating Reverse Proxy. It’s likely cheaper and safer in the long run.</p>
<p><a href="https://blog.petegoo.com/2023/09/14/zero-trust-proxies-aws-alb/">Zero Trust Authenticating Reverse Proxies in AWS Application Load Balancers</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on September 14, 2023.</p>https://blog.petegoo.com/2023/02/22/incident-response-lines-of-communication-enquiry2023-02-22T00:00:00+13:002023-02-22T00:00:00+13:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>Over the last decade or more I’ve been involved in a lot of incident response activity and subsequent post-mortems or post-incident reviews (a good thing). I’m always incredibly interested in the behaviours of incident responders and just how difficult it is to remember all the right things to do when you are the one sitting in the hot seat.</p>
<p>At times I have seen first hand that frozen state where you forget how everything works and have no idea where to look for clues. At other times I’ve seen that all-too-common scenario where the incident response team is busy looking around, diagnosing, investigating but they have forgotten to tell anyone that there is an incident going on so that customers and stakeholders can be informed.</p>
<p>A few years ago I came to the realisation that there are two distinct lists that you should keep for your incident response team.</p>
<ul>
<li>Lines of Enquiry</li>
<li>Lines of Communication</li>
</ul>
<p>I decided to put these in a wiki page that is easily at hand. Our incident response bot links to these pages so folks can easily find them when they think “Lines of what again?”.</p>
<h3 id="lines-of-enquiry">Lines of Enquiry</h3>
<p>These are going to be different for almost every organization but there are some common themes that you should try to hit.</p>
<p>Firstly, all things happen for a reason and that reason is almost always a change of some sort. Use your experience and trawl through your incident reports. Try to think of the most common changes that cause issues. These are typically deployments, infrastructure changes, patching, flag flips, etc.</p>
<p>For us, this looks like:</p>
<blockquote>
<ol>
<li>What did we change?</li>
</ol>
<p>Very often a system starts to misbehave after a change of some sort</p>
<ul>
<li>Was there a recent deploy?
<ul>
<li>Did the issue begin happening just after or during a new deployment?</li>
<li>Is there a chance that the deployed change is related?</li>
<li>Is the deployment safe to revert?</li>
<li>Deploy the previous release asap and continue to investigate while you wait for the result</li>
</ul>
</li>
<li>Are we currently patching machines?
<ul>
<li>Was there a change in machine images that could have caused the issue?</li>
<li>Was there a recent update of the operating system or a component?</li>
</ul>
</li>
<li>Has a related workload been deployed recently?</li>
<li>Have we just deployed some terraform changes?</li>
</ul>
</blockquote>
<p>Next you want to think about any changes in traffic patterns. If we didn’t change something explicitly, maybe the behaviour of our users has changed or we could be under some kind of volumetric attack.</p>
<blockquote>
<p>2. Has our traffic changed?</p>
<ul>
<li>Is there more traffic load?</li>
<li>Are we seeing unusual load on certain endpoints?</li>
<li>Is it a <insert seasonal outliers e.g. end of month, end of year>?</li>
<li>Are we processing a large number of queued messages / workload A / workload B / workload C?</li>
</ul>
</blockquote>
<p>It’s a good idea to give the readers links to dashboards etc here. Nobody wants to be messing around with subpar wiki search engines during an incident.</p>
<p>Next up, it’s always good to check if it’s someone else’s stuff that’s broken. I’ve found that, as your site reliability grows, it becomes painfully obvious how terrible other people are at theirs. It is not uncommon to be the one to make someone else aware their stuff is fried.
One of your best tools is DownDetector or, my personal favourite, just searching Twitter.</p>
<blockquote>
<p>3. Has one of our partners had a fault?</p>
<ul>
<li>Is <insert cloud vendor> reporting issues?
<ul>
<li><link to their status page></li>
<li>Single AZ failure? Regional?</li>
</ul>
</li>
<li>Twilio / SendGrid / Vendor A / Partner B ?</li>
<li>Global CDNs? CloudFlare? Fastly? Akamai?</li>
<li>DNS providers?</li>
<li>Mobile Carriers? AT&T? Verizon?</li>
<li>Large networking providers? Commcast? BGP again?</li>
</ul>
</blockquote>
<p>Then there’s a list of things that can change without any action by a human.</p>
<blockquote>
<p>4. What could have been changed on us?</p>
<ul>
<li>Could the SQL query optimizer have created a bad plan in the database that is impacting our query performance? Purge the worst plans.</li>
<li>Could a scheduled maintenance job have kicked in?</li>
<li>Could a container have recycled?</li>
<li>Autoscaling kicked in/out?</li>
</ul>
</blockquote>
<p>Now you want to think about less recent changes that may only be impacting now because of some confluence of events.</p>
<blockquote>
<p>5. Could this be the first time we have done this since a change?</p>
<ul>
<li>It may be the first time a certain type of scheduled activity has run since a change.</li>
<li>It could be that we have just scaled out our cluster since a change to the machine images, configuration, code.</li>
</ul>
</blockquote>
<p>Lastly it’s a good idea to think about adjacencies. You never know when you might benefit from that extra context which is going to challenge your assumptions.</p>
<blockquote>
<p>6. What else could be affected?</p>
<p>Look for clues from services with the same dependencies etc to see if they also have issues but are not producing alerts in the same way.</p>
<h1 id="lines-of-communication">Lines of Communication</h1>
</blockquote>
<p>My personal failure mode in the past has been getting lost in the details of an incident and forgetting to inform other people. They have been some of the hardest post-incident conversations. It’s understandable but it’s also incredibly frustrating when you find out way too late that something is going on, or customers are calling up complaining and you have nothing to tell them, or someone says “hey tell me about this ongoing incident for <that thing you’re responsible for>” and you have no idea what they are talking about.</p>
<p>So here are some starting points for lines of communication.</p>
<blockquote>
<ol>
<li>Have you got an Incident Commander?
<ul>
<li>You’re in Incident Response. You need one. Escalate to one if you don’t</li>
</ul>
</li>
<li>
<p>Have you told Customer Support / Success that customers could be affected?</p>
</li>
<li>Do you need to let your Manager / Director / VP know?
<ul>
<li><Insert your own escalation policy here e.g. If it’s Severity 1-3 then you should…></li>
</ul>
</li>
<li>Should we post a StatusPage?
<ul>
<li><insert your orgs philosophy on when this is appropriate and who makes the decision></li>
</ul>
</li>
<li>Have we opened a ticket with <cloud provider> / <Vendor A> / <Partner B>?
<ul>
<li>Let them know they have a problem</li>
</ul>
</li>
</ol>
</blockquote>
<p>Hopefully these two lists will serve you well. Remember that they are there, use them, and feed them regularly…</p>
<ul>
<li>If you’re in an incident, everything is burning and you don’t know what to do - <strong>Lines of Enquiry</strong></li>
<li>If you’re in an incident and you get that horrible feeling that you forgot to do something - <strong>Lines of Communication</strong></li>
<li>If you have a few spare minutes while you wait to see if something has had an impact - <strong>Lines of Communication</strong></li>
</ul>
<p><a href="https://blog.petegoo.com/2023/02/22/incident-response-lines-of-communication-enquiry/">Lines of Communication and Lines of Enquiry in Incident Response</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on February 22, 2023.</p>https://blog.petegoo.com/2022/02/18/small-prs-and-batch-size2022-02-18T00:00:00+13:002022-02-18T00:00:00+13:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>One of the things I find most interesting about my role these days is that I get to talk with such a wide variety of people across our engineering group and I get to hear various challenges that they face. As a result I get to observe common patterns and themes that emerge which often stem from similar issues. I’d like to talk about one of those common issues - Batch Size.</p>
<h3 id="my-experience">My experience</h3>
<p>When I joined my current company many years ago now there was something about the way that we worked that took me by surprise. It was fast, but more than that I felt more focused, I was more engaged than I had ever been and I was learning rapidly.</p>
<p>We deployed multiple times per day. We were reviewing each other’s code multiple times per day. Our poor QA (singular) was getting smashed but managing to keep up with this mayhem. How was this possible? I’d never seen this pace before.</p>
<p>The reason it worked was that we had made each change or “Pull Request” small and incremental. We had a saying that was basically “do the smallest, dumbest thing you can to learn the next thing”.</p>
<h3 id="the-science-bit">The science bit</h3>
<p>The advantages of breaking tasks down into smaller chunks is something we have all experienced in lots of aspects of our lives.</p>
<p>In <a href="https://en.wikipedia.org/wiki/Lean_manufacturing">manufacturing</a> and economics circles this is sometimes referred to as Lot or Batch Size and it’s an important factor in the throughput and efficiency of any system. Don Reinertsen does a <a href="https://www.youtube.com/watch?v=zVASqSj_kvc">really good job of explaining the theory</a> but ultimately reducing the size of the changes we make leads to reduced cycle times, consistent flow, faster feedback, reduced risk, fewer overheads, greater efficiency, higher motivation and reduced costs.</p>
<p>Our batch is a Pull Request. It is the car in our assembly line.<br />
<sub><em>(If you have a long lived feature branch then your batch is the feature branch))</em></sub></p>
<h3 id="how-does-this-benefit-me-you">How does this benefit me (you)?</h3>
<p>The weird thing about my experience was that it also caused me to change my behaviour and the relationship I had with my code. I used to write a bunch of code on my machine, add to it, add more, refactor, add more, test it, clean it up, write tests (whoops), double check it, triple check it, then eventually let someone else see it when I knew it was safe for me to do so. I was optimising for never being wrong, not learning to improve. I built up so much stress during this process that it was a rollercoaster of fear and insecurity. <a href="https://rework.withgoogle.com/blog/five-keys-to-a-successful-google-team/">Not good</a>.</p>
<p>In my current company however, if I had more than a day’s work on my machine I started to get nervous, like I was walking around with a wallet full of too much cash.</p>
<p>There are other benefits though. Code reviews are quick and easy. More tests are automated. QAs are able to focus on what matters. Product and UX can provide timely feedback. Course corrections happen earlier before wasted effort. Incident impacts are smaller and downtime is shorter because when things go wrong in production it’s easy to see what changed. We do fewer revolutions, big refactorings, rewrites. Long lived feature branches and merge conflict resolutions are a thing of the past. We focus on continually providing value, learning from how our customers use our software and responding to their changing needs.</p>
<p>This isn’t to say that we shouldn’t design software or plan how we will implement it. This is still a skill we need to exercise, however no design or plan is perfect so why would we wait to find that out?</p>
<p><a href="https://blog.petegoo.com/2022/02/18/small-prs-and-batch-size/">Small Pull Requests and Batch Size</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on February 18, 2022.</p>https://blog.petegoo.com/2020/02/26/this-is-my-picture2020-02-26T00:00:00+13:002020-02-26T00:00:00+13:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>How inclusive is the culture of architecture and design within your teams? Is the way you communicate these designs and intent accessible by everyone? Can everyone in your organisation easily contribute ideas and share their experience?</p>
<p>For the last few years I’ve been trying to actively spend more time drawing code, architectures and concepts rather than just talking about them. I feel like this was once a lot more common in the places I have worked but seems to have become something of a rarity.</p>
<p><img src="/images/2020/this-is-my-picture.gif" alt="marker-winning" /></p>
<h2 id="the-problem-with-words">The problem with words</h2>
<p>When I first moved to New Zealand I found myself working for a company building legal software. On day one an architect took me into a room and started explaining the architecture of the product in the terminology of <a href="https://martinfowler.com/tags/domain%20driven%20design.html">Domain Driven Design (DDD)</a>, a concept I had never come across before that is packed full of specific terminology and concepts. So much so in fact that <a href="https://www.amazon.com/Domain-Driven-Design-Tackling-Complexity-Software/dp/0321125215/ref=sr_1_1?keywords=domain+driven+design&qid=1582796070&sr=8-1">the bible of DDD by Eric Evans</a> is 560 pages of pure gold but takes quite a few reads before it really sinks in. <em>Ironically one of the main goals of DDD is to reduce confusion between individuals and teams by
devising a shared ubiquitous language</em></p>
<p>So here I was getting a massive download of architecture and historical context. Like a typical <a href="https://en.wikipedia.org/wiki/Impostor_syndrome">imposter dev</a>, I wasn’t going to admit that a lot of these new words and acronyms were completely foreign to me.</p>
<p>So how then was I able to understand any of what was being said at the time?</p>
<p>The simple answer is that he drew on a whiteboard with a marker as he talked. The result was that this impervious wall of nomenclature washed over me while the picture filled in the blanks that were left behind.</p>
<h2 id="so-why-are-pictures-important">So why are pictures important</h2>
<p><strong>Drawing pictures in front of people is a <em>room leveller</em>.</strong></p>
<p>We have so much assumed context when we present to or talk with our peers that we often forget that we may have completely different backgrounds and experiences. For example:</p>
<ul>
<li>We may not all have the same level of confidence in groups and so we may not ask questions</li>
<li>We may not all have the same first spoken language</li>
<li>We may not all come from the same social circle / tech / culture or country</li>
<li>We may not all have gone to college / university or read the same comprehensive book on exotic and seldom used design patterns</li>
</ul>
<p>Simple diagrams can transcend these differences. The box that represents a thing. The line that represents a relationship of some sort, traffic, data or control flow. A big cloud of amorphous internet. A stick figure user. These shapes describe abstract concepts more universally than any specific words could.</p>
<h2 id="why-is-drawing-pictures-important">Why is drawing pictures important?</h2>
<p>There is more intent and meaning communicated in the act of drawing than just the end product.</p>
<p>I’ve often had the experience where I’ve tried to recreate a successful whiteboarding session to someone else using only the finished picture, expecting them to instantly gain the same understanding I did when it was drawn. Except now, they just look at me blankly. Why? I may even look back at the drawing and think, this is nonsense. These are the insane scribblings of an unhinged individual.</p>
<p>Watching a drawing unfold in front of you as the intent is explained builds understanding as the picture evolves. While the conversation continues the breadcrumbs of how that understanding was formed are still there in full view, reinforcing the conceptual model we have built as it is committed to memory. At the same time it allows the presenter to add a layer of language and terminology on top of that new understanding.</p>
<p>That newly accumulated understanding, language and terminology will always be rooted in the memory of the drawing. A drawing that is relevant and helpful. Without these visual memories I tend to associate what I learned with the shoes of the presenter, the smell of the room or the colour of the walls.</p>
<p>These visual anchors that are still on the board are even more useful when participation is encouraged. The terminology and scope can be expanded as those present contribute to the refinement process. This interactive part is where the diagram really comes into its own. Use it as others expand on your ideas and provide alternatives. When questions are asked, point back to the relevant elements as you answer. You may even find yourself drawing more elements or just pointing as others talk and refine your explanation.</p>
<p><img src="/images/2020/MVIMG_20191004_161559.jpg" alt="this-is-my-picture" /></p>
<h2 id="how-to-do-it">How to do it</h2>
<p>So there are some rules to follow when whiteboarding / drawing for a group</p>
<ul>
<li>No UML!!!</li>
<li>Limit the predefined shapes. If you need to draw a pipe for a queue, explain what it is and why a pipe works.</li>
<li>A database might be a cylinder but reinforce what it is as you draw it</li>
<li>Stick to boxes and lines as much as possible</li>
<li>Sure, add arrows for direction</li>
<li>Use colour sparingly. Two, three colours. After that people need a legend.</li>
<li>Try to avoid sequence diagrams. They don’t work well for asynchrony or fan out / fan in.</li>
<li>Don’t mix different levels of abstraction in the same diagram. Use an inset or callout box for detail.</li>
<li>No fricking UML!!! Nobody cares that you took the time to learn it once.</li>
</ul>
<h2 id="running-an-open-inclusive-session">Running an open inclusive session</h2>
<p>It’s a really good idea to encourage this style of presentation and communication within an organisation. Don’t just settle for the same individuals talking at the others in your teams. Diagramming and whiteboarding can make your workplace more inclusive and democratise architecture and design.</p>
<p>In our office Friday at 3pm is “This is my picture” time.</p>
<ul>
<li>It lasts an hour</li>
<li>We have beer and pizza at 4:30 so people are winding down anyways, might as well maximise on the reluctance people have to start deep thinking.</li>
<li>We have a Slack channel where we remind folks that it is on</li>
<li>We make a list on the board of carry over items from last week</li>
<li>We ask for ideas, these can be any of:
<ul>
<li>Something you want to build</li>
<li>Something you have built</li>
<li>Something you are building</li>
<li>Something that exists</li>
<li>Something you can draw</li>
<li>Something you want to see someone else draw</li>
<li>Something you know</li>
<li>Something you want to know</li>
</ul>
</li>
<li>List the ideas on the board</li>
<li>Now ask for a show of hands on each item in turn</li>
<li>Mark the votes beside each item</li>
<li>Start with the highest voted item and ask for volunteers to draw it</li>
<li>If nobody is willing to draw it or it doesn’t get enough votes then it can carry over</li>
<li>Almost always it spills over in discussion to drinks and pizza</li>
<li>Live slack the list of ideas so that stragglers can opt in</li>
<li>Take photos of the presenters with the end product. Like a kid holding up their picture to the class. Post them to the Slack channel for posterity.</li>
<li>Don’t be afraid to repeat ideas weeks or months later. Try a different presenter. There will be new people present and/or different perspectives in the room.</li>
</ul>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">I run an open whiteboarding session on Fridays in our office. For Christmas my team made me some magnetic icons. All the usual ones are there AWS Buckets, Route 53, EC2, MLP, potatoes... <a href="https://t.co/KSt3n65zrP">pic.twitter.com/KSt3n65zrP</a></p>— Peter Goodman (@petegoo) <a href="https://twitter.com/petegoo/status/1207413304877932544?ref_src=twsrc%5Etfw">December 18, 2019</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<h2 id="faq">FAQ</h2>
<p><strong>Q</strong>. What about having multiple people working together on a board?<br />
<strong>A</strong>. This is ok for a discussion between two people but is confusing for a group. Try to avoid unless the two presenters have a good collaborative presentation style. You’re telling a story after all.</p>
<p><strong>Q</strong>. What about remote teams / individuals?<br />
<strong>A</strong>. Well sometimes we try to do it on hangouts and/or record it. This is <em>ok</em>. YMMV</p>
<p><strong>Q</strong>. What about digital / online tools. <br />
<strong>A</strong>. Keen to hear about suggestions but these can be prohibitively expensive and clunky.</p>
<p><strong>Q</strong>. Is this just for developers?<br />
<strong>A</strong>. Absolutely not. I’ve been trying to get more QAs, UX, Designers, Product people involved.</p>
<p><a href="https://blog.petegoo.com/2020/02/26/this-is-my-picture/">This is my picture: Why you should be drawing your systems and code</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on February 26, 2020.</p>https://blog.petegoo.com/2018/11/15/better-octopus-registration2018-11-15T00:00:00+13:002018-11-15T00:00:00+13:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>We use <a href="https://octopus.com">Octopus Deploy</a> for a lot of our deployments. It has great primitives that help us create simple, reliable, repeatable deployment processes.</p>
<p>If you are using it to deploy code onto machines you typically use the supplied agent, installed on your machine. Octopus calls this agent a Tentacle, naturally. The Tentacle registers itself with the API on the Octopus server. Your machine will define which roles it would like to perform and these roles can be used to define which deployment steps will run for that machine.</p>
<h2 id="the-problem">The Problem</h2>
<p>There is an issue with this registration approach however. In order to be able to register, the Tentacle needs an API key. You can scope that API key to an Environment like Test, Prod etc but not, as far as I can tell, to a role. Therefore <strong>the tentacle can ask to be any role it likes</strong>. Even if you could restrict the API key to a role, managing the keys and their scopes would be a nightmare.</p>
<p>So now we know that a machine we intend to be <code class="language-plaintext highlighter-rouge">non-sensitive-role</code>, if it were compromised by an attacker can now register (or re-register) itself as <code class="language-plaintext highlighter-rouge">very-sensitive-role</code>, essentially creating a form of <a href="https://en.wikipedia.org/wiki/Network_Lateral_Movement">lateral movement</a>. For example, if the <code class="language-plaintext highlighter-rouge">very-sensitive-role</code> delivered some code with a database connection string and password from Octopus variables but <code class="language-plaintext highlighter-rouge">non-sensitive-role</code> was never designed to get those secrets then we have a problem.</p>
<p><img src="/images/2018/octopus_reg_old.png" alt="old octopus registration" /></p>
<h2 id="the-solution">The Solution</h2>
<p>So how do we get around this. Well, we wanted to eliminate the reliance on the machine telling us what role it should be. Our machines are in AWS and we can use EC2 Tags to add metadata to our machines. So we added an <code class="language-plaintext highlighter-rouge">OctopusRole</code> tag when AWS creates the machine (EC2 instance) with the name of the role(s) intended to be used by that machine. You can also add <code class="language-plaintext highlighter-rouge">OctopusMachinePolicy</code> if you want.</p>
<p>Then when we want to register the machine on startup, we remove the need for access to the API by instead publishing an SNS message that simply has the EC2 Instance Id and any other useful information like the Tentacle thumbprint.</p>
<p>This SNS message triggers a Lambda which uses the EC2 APIs to query the metadata for the instance. The lambda then registers with Octopus on behalf of the instance. Octopus will subsequently reach out to the machine to establish the connection to the listening Tentacle agent.</p>
<p><img src="/images/2018/octopus_reg_new.png" alt="new octopus registration" /></p>
<h2 id="what-did-we-learn">What did we learn?</h2>
<p>Basically validate your client inputs. They can lie like terrible lying things.</p>
<p>Lambda is great piece of glue you can use to solve these types of problems. Now you can even write them in Powershell should you so desire. Personally I write most of mine in Python but you can choose your poison without needing to change this pattern.</p>
<p><a href="https://blog.petegoo.com/2018/11/15/better-octopus-registration/">Better Octopus Registration</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on November 15, 2018.</p>https://blog.petegoo.com/2018/11/09/optimizing-ci-cd-pipelines2018-11-09T00:00:00+13:002018-11-09T00:00:00+13:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>Over the last 4 years I’ve often found myself in conversations with fellow engineers about our build and deployment process and how
we feel it has become slower or is somehow causing more friction.</p>
<p>Eventually, as you live with a release process that you use
every day you will find that you have these conversations relatively often. So how do you go about figuring out if you have issues
and where they are.</p>
<h2 id="tldr">TL;DR</h2>
<ul>
<li>Draw it out</li>
<li>Be mindful of your bias</li>
<li>Measure everything</li>
<li>Gather feedback on outliers</li>
<li>Split and parallelize steps</li>
<li>Look for human wait times</li>
<li>Help your engineers solve their own problems earlier, before they become everyone else’s problem</li>
</ul>
<h2 id="why-measure">Why measure</h2>
<p>We all know that smaller releases, more often helps us deliver value to our customers, with less risk and this gives us a competitive advantage. Therefore we all want more throughput in our pipelines. If you have anything more complicated than a very simple one build, one test suite, one deploy pipeline then this can be a difficult thing to achieve.</p>
<p>We use a train metaphor for the pipelines involved in the shipping of our releases. Sure you can build more trains but that comes with complexity (to really drag the metaphor out, junctions and stations). A faster train is always going to bring benefits to your continuous delivery pipeline. Build faster trains.</p>
<p>To do this you need to measure how fast your releases are.</p>
<p>As an aside, I used to think that the number of releases performed per day is the best statistic to track. It is interesting but to be honest it’s basically bragging rights in a lot cases.</p>
<p><strong>What you should care about is how fast you can release when you need to, not how many times you release per day/week/month</strong></p>
<h2 id="optimize-wisely-not-with-bias">Optimize Wisely, Not With Bias</h2>
<p>Most software engineers, me included, have their own bias about what they think is the worst, slowest, flakiest part of the release pipeline. This opinion comes not from observation and measurement but from scar tissue and technical preference. Resist the urge to optimize for what you think the problem is. Measure it and make informed decisions about where to spend your time.</p>
<h2 id="what-you-will-need">What you will need</h2>
<ul>
<li>A pen and paper</li>
</ul>
<p>or even better</p>
<ul>
<li>A whiteboard and marker</li>
</ul>
<p>Yeah, this isn’t really about tools, it can be but it doesn’t have to be.</p>
<p>I’m a real believer in the power of diagramming, I’m not talking about UML here. In fact I’m specifically talking about NOT UML. Boxes, lines and words are what you need. Patterns & Practices, acronyms and specific terminology can be incredibly devisive. Diagrams are a room leveller, they bring everybody into the same conversation, losing the least amount of participants along the way.</p>
<h2 id="getting-started">Getting Started</h2>
<p>Think about the start of your deployment pipeline and draw the first box. Don’t go back to requirements gathering or some nonsense like that. Start with a Pull Request for example, or a merge, or in our case, joining our ship-it train.</p>
<h2 id="draw-the-process">Draw the Process</h2>
<p>You may have a CI/CD tools that has pipelines but I guarantee there are more things involved here so draw it manually, it will free you from the constraints of the pipeline tool.</p>
<p>From there you want to start thinking about the stages that happen up until the point that the automation starts. There may be none if you started with a merge or there may be some human co-ordination involved.</p>
<p>This is key, <strong>you need to capture the human processes too</strong>.</p>
<p>Draw each state in the state machine. Connect them with lines to show the sequence. For us this looks like</p>
<ul>
<li>Join</li>
<li>Roll Call</li>
<li>Merge</li>
</ul>
<p>When you come to the chain of builds, draw each build stage and try to represent fan out/fan in of parallel tasks, this will become important later.</p>
<p>For me I choose to draw the stages as a vertical pipeline then switched to horizontal for the builds.</p>
<p>Your process should end with the point at which you are happy with the release in production.</p>
<p><img src="/images/2018/pipeline1.png" alt="pipeline1" /></p>
<h2 id="measure-the-builds-steps-and-stages">Measure the Builds, Steps and Stages</h2>
<p>The next step is to add timings to the steps involved. I find it easiest to start with the builds.</p>
<p>Look at your builds and test runs and take a sample of timings for each type. Figure out what the median is and write it next to that build step or test run box in your pipeline drawing.</p>
<p>At this point you have some timings and there are things we can infer and optimize which you will see later but resist the urge to concentrate on the automation. Often the biggest problems and most effective changes can be found elsewhere.</p>
<p>Note the deployment step times, for each environment. Some environments for us take longer because they have more machines.</p>
<p>Do the same with the manual steps and stages. In our pipeline we use a bot to orchestrate the pipeline stages, it co-ordinates the human workflow in a simple state machine by listening for prompts from engineers involved in the release. The bot posts into slack the current stage. I use the timestamp of those Slack messages to write down some timings. If you have normal human Slack conversations only, try to determine the start of each stage from the timeline or encourage folks to post the stage for a few days to get these numbers. Again take the median and add it next to each stage or step.</p>
<h2 id="note-your-end-to-end-pipeline-time">Note your End to End Pipeline Time</h2>
<p>For us I like to measure from Roll Call to the start of the next pipeline. This to me is the time it takes us to ship a release.</p>
<p>Decide what your end-to-end pipeline measure is and take note of the time it takes. Improving this metric is your goal.</p>
<h2 id="track-why-some-releases-take-longer">Track Why Some Releases Take Longer</h2>
<p>Now that you have a timing for how long this normally takes, as an engineering team, start recording why it sometimes takes longer.</p>
<p>Some common examples are:</p>
<ul>
<li>Complex manual testing
<ul>
<li>The changes touched a lot of things so it needed more manual testing</li>
</ul>
</li>
<li>Re-work in the pipeline
<ul>
<li>Compile errors</li>
<li>Test failures</li>
<li>Reverts</li>
</ul>
</li>
<li>People orchestration
<ul>
<li>Key people were in meetings, out to lunch</li>
<li>I didn’t notice that I was up / required to do something</li>
<li>A failed build wasn’t noticed until some time later</li>
<li>A key person e.g. tester had too many things to do</li>
</ul>
</li>
</ul>
<h2 id="parallelize">Parallelize</h2>
<p>Look at your build steps and test suites with their timings. You can now see some optimizations where parallelism can be a deciding factor in what you do next.</p>
<h3 id="can-you-run-some-things-in-parallel">Can you run some things in parallel?</h3>
<p>Some steps can be easily parallelized. If you have 5 consecutive test runs, can you do them in parallel? Your CI tool can most likely orchestrate this for you.</p>
<p>We chose to run some of our tests in parallel with the deploy to our test environment. Eventually we even ran our unit tests in parallel with our deploy to test. We made this decision because we looked at the failure rate of unit tests. They didn’t fail. They had already been run and passed on the developer’s machine and then again on the Pull Request branches before merge so we knew they were good. Sure, a merge could create a problem but this was so rare it was worth the risk of re-work in the pipeline.</p>
<h3 id="split-builds">Split Builds</h3>
<p>Look at the longest test suites. Can they be split into multiple parallel build steps / test runs?</p>
<h3 id="dont-spend-time-making-small-things-faster">Don’t Spend Time Making Small Things Faster</h3>
<p>Look at the timings. If Test Suite A takes 5 mins and Test Suite B runs in parallel and takes 10 minutes, don’t spend time on Test Suite A trying to make it faster, it won’t affect your end-to-end timings.</p>
<h2 id="look-for-human-wait-times">Look for Human Wait Times</h2>
<p>Sure, sometimes we are waiting on the computers to do build things or test things. Often, however, it is the co-ordination of the meat-bags (humans) that is the problem.</p>
<p>For example, we use a build bot modelled on <a href="http://pushtrain.club/">the Etsy train</a> but implemented in Slack. We call it <code class="language-plaintext highlighter-rouge">C3-PR</code> (PR for Pull Requsts). One thing we found is that if we mention the people in the carriage we have a better chance of having them perform the tasks we need them to like Merge, Deploy etc. If you have no human involvement in your pipeline then I commend you, but most folks I talk to have some human involvement at least in failure scenarios. These human factors therefore can be very important in realising maximum throughput in your pipeline.</p>
<h2 id="be-kind-to-your-people">Be Kind To Your People</h2>
<ul>
<li>Can your build tool notify people earlier that a test has failed and continue on with the rest or does it have to wait until the end of the test suite?</li>
<li>Could an Engineer have found the source of re-work (build / test failure) earlier on their machine or the Pull Request before it was merged into master? In other words <strong>Help your engineers solve their own problems earlier, before they become everyone else’s problem</strong></li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>Hopefully this framework can help you optimize your own CI/CD pipelines. It has certainly helped me over the years when reasoning about where to spend time and why.</p>
<p><a href="https://blog.petegoo.com/2018/11/09/optimizing-ci-cd-pipelines/">Measuring and Improving your CI/CD Pipelines</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on November 09, 2018.</p>https://blog.petegoo.com/2018/04/16/concourse-aws-lifecycle-hooks2018-04-16T00:00:00+12:002018-04-16T00:00:00+12:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>Lately we’ve been running <a href="https://concourse-ci.org/">Concourse CI</a> for a bunch of our builds.
We really love Concourse for the pipeline features, ease of configuration, and docker primitives.
However, operating and feeding Concourse can be a voyage of discovery and sometimes sadness.</p>
<p>One of the issues with Concourse is that it doesn’t really like it when workers disappear on it.
The workers will appear as <code class="language-plaintext highlighter-rouge">stalled</code> if you run <code class="language-plaintext highlighter-rouge">fly workers</code>. This means that any resources that
are performing <code class="language-plaintext highlighter-rouge">check</code> operations for new versions will be stuck and not trigger builds.
You then need to <code class="language-plaintext highlighter-rouge">prune-worker</code> if you want your builds to keep working.</p>
<p>This post aims to give you the basics for getting lifecycle management a bit better so you can
simply roll the instances in your worker pool Auto-Scaling Group (ASG) when you want to get some
fresh ones without incurring the annoyance of having to clear out those stalled workers.</p>
<h2 id="lifecycle-hook">Lifecycle Hook</h2>
<p>Hopefully you are running your Concourse workers in an Auto-Scaling Group. When your ASG removes
these instances nothing will tell Concourse that they are dead. To make this happen you need to
create an <a href="https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html">Auto-Scaling Lifecycle Hook</a>.</p>
<p><a href="https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html#adding-lifecycle-hooks">Create a lifecycle hook</a> for termination called <code class="language-plaintext highlighter-rouge">worker-terminating</code>.</p>
<p>Add the following script in a CRON job run every minute or two.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># Need this path to allow aws command to work</span>
<span class="nv">PATH</span><span class="o">=</span><span class="nv">$PATH</span>:/usr/local/bin
<span class="nv">instance_id</span><span class="o">=</span><span class="si">$(</span>curl <span class="nt">-s</span> http://169.254.169.254/latest/meta-data/instance-id/<span class="si">)</span>
<span class="nv">lifecycleState</span><span class="o">=</span><span class="si">$(</span>aws autoscaling describe-auto-scaling-instances <span class="nt">--instance-ids</span> <span class="nv">$instance_id</span> <span class="nt">--query</span> <span class="s1">'AutoScalingInstances[0].LifecycleState'</span> <span class="nt">--output</span> text <span class="nt">--region</span> us-west-2<span class="si">)</span>
<span class="k">if</span> <span class="o">[</span> <span class="s2">"</span><span class="nv">$lifecycleState</span><span class="s2">"</span> <span class="o">==</span> <span class="s2">"Terminating:Wait"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
</span><span class="nv">asg</span><span class="o">=</span><span class="si">$(</span>aws autoscaling describe-auto-scaling-instances <span class="nt">--instance-ids</span> <span class="nv">$instance_id</span> <span class="nt">--query</span> <span class="s1">'AutoScalingInstances[0].AutoScalingGroupName'</span> <span class="nt">--output</span> text <span class="nt">--region</span> us-west-2<span class="si">)</span>
<span class="c"># We store the TSA Host parameter</span>
<span class="nv">TSA_HOST</span><span class="o">=</span><span class="s2">"my.tsa.host"</span>
concourse retire-worker <span class="se">\</span>
<span class="nt">--name</span> <span class="si">$(</span><span class="nb">hostname</span><span class="si">)</span> <span class="se">\</span>
<span class="nt">--tsa-host</span> <span class="nv">$TSA_HOST</span> <span class="se">\</span>
<span class="nt">--tsa-public-key</span> /path/to/tsa-public-key <span class="se">\</span>
<span class="nt">--tsa-worker-private-key</span> /path/to/tsa-worker-private-key
<span class="c"># Sleep for 10 minutes to let the builds finish. I know, not ideal but it works for now</span>
<span class="nb">sleep </span>10m
service concourse-worker stop
aws autoscaling complete-lifecycle-action <span class="se">\</span>
<span class="nt">--instance-id</span> <span class="nv">$instance_id</span> <span class="se">\</span>
<span class="nt">--auto-scaling-group-name</span> <span class="nv">$asg</span> <span class="se">\</span>
<span class="nt">--lifecycle-hook-name</span> <span class="s2">"worker-terminating"</span> <span class="se">\</span>
<span class="nt">--lifecycle-action-result</span> <span class="s2">"CONTINUE"</span> <span class="se">\</span>
<span class="nt">--region</span> us-west-2
<span class="k">fi</span>
</code></pre></div></div>
<p><a href="https://blog.petegoo.com/2018/04/16/concourse-aws-lifecycle-hooks/">Concourse on AWS: Worker lifecycle management</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on April 16, 2018.</p>https://blog.petegoo.com/2016/05/10/packer-aws-windows2016-05-10T00:00:00+12:002016-05-10T00:00:00+12:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>Getting a Packer build to work with the AWS EBS builder is pretty easy. Getting it to work for Windows can be a series of less-than-obvious discoveries. I had issues trying to find a concise guide on how to get the various pieces working together, so here it is.</p>
<p><a href="https://github.com/PeteGoo/packer-win-aws">All code available here</a></p>
<h2 id="the-goal">The goal</h2>
<p>We want Packer to create an EC2 AMI using a powershell initialization script. To achieve this Packer will create a new EC2 instance, run our script and then take an image of it before terminating our builder instance. We need any communication with the builder instance to use https rather than http so there is something approaching secure communication (although here we will use a self-signed cert, created on the instance itself).</p>
<ul>
<li>Builder: amazon-ebs</li>
<li>Provisioner: powershell</li>
</ul>
<h2 id="using-the-amazon-ebs-builder">Using the amazon-ebs builder</h2>
<p>The amazon-ebs builder is actually pretty good. The configuration is well documented and the config will end up looking something like below:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"builders": [{
"type": "amazon-ebs",
"region": "us-east-1",
"source_ami": "ami-3d787d57",
"instance_type": "m3.medium",
"ami_name": "windows-ami ",
}]
}
</code></pre></div></div>
<h2 id="winrm-and-the-infinite-sadness">WinRM and the infinite sadness</h2>
<p>The next issue is that we need to be able to add a provisioner so we can run some scripts on the new builder instance. On linux boxes this is pretty standard as ssh actually works. Unfortunately on Windows in order to run Powershell remotely on the Packer builder instance we have to use Powershell remoting and that means WinRM.</p>
<p>WinRM was originally designed for a world that was built on WS-*, SOAP and Kerberos authentication in Windows domains. Hence it has been plagued by configuration woes since it was first introduced. Getting it to work for Packer over the internet can be a pain.</p>
<p>So let’s tell Packer to use winrm.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"builders": [{
"type": "amazon-ebs",
"region": "us-east-1",
"source_ami": "ami-3d787d57",
"instance_type": "m3.medium",
"ami_name": "windows-ami ",
"user_data_file":"./ec2-userdata.ps1",
"communicator": "winrm",
"winrm_username": "Administrator",
}]
</code></pre></div></div>
<p>If you run this you will probably end up with the dreaded <code class="language-plaintext highlighter-rouge">waiting for winrm to become available</code> message from Packer that just sits there looking at you. This means that WinRM is not configured on the instance.</p>
<p>To resolve this problem we need to run a script on the builder instance to bootstrap WinRM. The way we tell an EC2 instance to run a script on first startup is the <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html#instancedata-add-user-data">UserData</a> script. On Windows this script <a href="http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ec2-instance-metadata.html">can contain</a> a <code class="language-plaintext highlighter-rouge"><powershell></powershell></code> section.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><powershell>
write-output "Running User Data Script"
write-host "(host) Running User Data Script"
Set-ExecutionPolicy Unrestricted -Scope LocalMachine -Force -ErrorAction Ignore
# Don't set this before Set-ExecutionPolicy as it throws an error
$ErrorActionPreference = "stop"
# Remove HTTP listener
Remove-Item -Path WSMan:\Localhost\listener\listener* -Recurse
# WinRM
write-output "Setting up WinRM"
write-host "(host) setting up WinRM"
cmd.exe /c winrm quickconfig -q
cmd.exe /c winrm quickconfig '-transport:http'
cmd.exe /c winrm set "winrm/config" '@{MaxTimeoutms="1800000"}'
cmd.exe /c winrm set "winrm/config/winrs" '@{MaxMemoryPerShellMB="1024"}'
cmd.exe /c winrm set "winrm/config/service" '@{AllowUnencrypted="true"}'
cmd.exe /c winrm set "winrm/config/client" '@{AllowUnencrypted="true"}'
cmd.exe /c winrm set "winrm/config/service/auth" '@{Basic="true"}'
cmd.exe /c winrm set "winrm/config/client/auth" '@{Basic="true"}'
cmd.exe /c winrm set "winrm/config/service/auth" '@{CredSSP="true"}'
cmd.exe /c winrm set "winrm/config/listener?Address=*+Transport=HTTP" '@{Port="5985"}'
cmd.exe /c netsh advfirewall firewall set rule group="remote administration" new enable=yes
cmd.exe /c netsh firewall add portopening TCP 5985 "Port 5985"
cmd.exe /c net stop winrm
cmd.exe /c sc config winrm start= auto
cmd.exe /c net start winrm
cmd.exe /c wmic useraccount where "name='vagrant'" set PasswordExpires=FALSE
</powershell>
</code></pre></div></div>
<p>We can now try to run the packer build</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>packer build template.json
</code></pre></div></div>
<h2 id="but-winrm-still-cant-connect">But WinRM still can’t connect?</h2>
<p>If you still get the <code class="language-plaintext highlighter-rouge">waiting for winrm to become available</code> message and it doesn’t progress after a few minutes then something may have gone wrong in the above script. To diagnose that issue run packer with the debug flag.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>packer build -debug template.json
</code></pre></div></div>
<p>Grab the Administrator login from the Packer output, you will need it. Then add an inbound RDP rule on the Packer build instance’s security group so you can RDP to it. Look for the log at <code class="language-plaintext highlighter-rouge">C:\Program Files\Amazon\Ec2ConfigService\Logs\Ec2ConfigLog.txt</code>. You may need to add logging in the above script to figure out what is going wrong.</p>
<h2 id="but-the-security-man">But the security man!</h2>
<p>OK, so this script is ok but the communication is over plain http which is a little less than ideal. To make this https we can generate a new certificate on the machine and use that. We switch the port to <code class="language-plaintext highlighter-rouge">5986</code> and tell WinRM we are using https.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><powershell>
write-output "Running User Data Script"
write-host "(host) Running User Data Script"
Set-ExecutionPolicy Unrestricted -Scope LocalMachine -Force -ErrorAction Ignore
# Don't set this before Set-ExecutionPolicy as it throws an error
$ErrorActionPreference = "stop"
# Remove HTTP listener
Remove-Item -Path WSMan:\Localhost\listener\listener* -Recurse
$Cert = New-SelfSignedCertificate -CertstoreLocation Cert:\LocalMachine\My -DnsName "packer"
New-Item -Path WSMan:\LocalHost\Listener -Transport HTTPS -Address * -CertificateThumbPrint $Cert.Thumbprint -Force
# WinRM
write-output "Setting up WinRM"
write-host "(host) setting up WinRM"
cmd.exe /c winrm quickconfig -q
cmd.exe /c winrm set "winrm/config" '@{MaxTimeoutms="1800000"}'
cmd.exe /c winrm set "winrm/config/winrs" '@{MaxMemoryPerShellMB="1024"}'
cmd.exe /c winrm set "winrm/config/service" '@{AllowUnencrypted="true"}'
cmd.exe /c winrm set "winrm/config/client" '@{AllowUnencrypted="true"}'
cmd.exe /c winrm set "winrm/config/service/auth" '@{Basic="true"}'
cmd.exe /c winrm set "winrm/config/client/auth" '@{Basic="true"}'
cmd.exe /c winrm set "winrm/config/service/auth" '@{CredSSP="true"}'
cmd.exe /c winrm set "winrm/config/listener?Address=*+Transport=HTTPS" "@{Port=`"5986`";Hostname=`"packer`";CertificateThumbprint=`"$($Cert.Thumbprint)`"}"
cmd.exe /c netsh advfirewall firewall set rule group="remote administration" new enable=yes
cmd.exe /c netsh firewall add portopening TCP 5986 "Port 5986"
cmd.exe /c net stop winrm
cmd.exe /c sc config winrm start= auto
cmd.exe /c net start winrm
</powershell>
</code></pre></div></div>
<h2 id="adding-the-provisioner">Adding the provisioner</h2>
<p>Finally we can add our provisioner to our template.json.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"builders": [{
"type": "amazon-ebs",
"region": "us-east-1",
"source_ami": "ami-3d787d57",
"instance_type": "m3.medium",
"ami_name": "windows-ami ",
"user_data_file":"./ec2-userdata.ps1",
"communicator": "winrm",
"winrm_username": "Administrator",
"winrm_use_ssl": true,
"winrm_insecure": true
}],
"provisioners": [
{
"type": "powershell",
"script": "init.ps1"
}
]
}
</code></pre></div></div>
<p>Notice that we are now specifying <code class="language-plaintext highlighter-rouge">winrm_use_ssl</code>. The inclusion of <code class="language-plaintext highlighter-rouge">winrm_insecure</code> means that the Packer client will not verify the certificate chain which will obviously fail for our self signed certificate.</p>
<p>We can now add whatever setup we need into our init.ps1 script which will run over our (slightly more) secure WinRM connection.</p>
<p>The entire repo for this sample can be found at <a href="https://github.com/PeteGoo/packer-win-aws">https://github.com/PeteGoo/packer-win-aws</a>.</p>
<p><a href="https://blog.petegoo.com/2016/05/10/packer-aws-windows/">Getting Packer to work for Windows on AWS</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on May 10, 2016.</p>https://blog.petegoo.com/2016/05/10/codemania-talk2016-05-10T00:00:00+12:002016-05-10T00:00:00+12:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>Speaking at <a href="http://codemania.io">codemania.io</a> was one of the scariest and most awesome experiences of my career to date.</p>
<p>The concept of the talk was that the secret of going fast, safely, is to raise the visibility of the system you develop and operate.
Below are the slides from that talk.</p>
<script async="" class="speakerdeck-embed" data-id="7bdaee9c5eb4432cad9002bcb082adc8" data-ratio="1.77777777777778" src="//speakerdeck.com/assets/embed.js"></script>
<h1 id="links">Links</h1>
<ul>
<li><a href="https://www.youtube.com/watch?v=LdOe18KhtT4">10+ Deploys Per Day: Dev and Ops Cooperation at Flickr</a></li>
<li><a href="http://www.jedi.be/blog/2010/02/12/what-is-this-devops-thing-anyway/">What is this devops thing, anyway?</a></li>
<li><a href="https://www.youtube.com/watch?v=czes-oa0yik">Metrics, metrics, everywhere - Coda Hale</a></li>
<li><a href="https://channel9.msdn.com/Shows/DevOps-Dimension/6--Blameless-Postmortems-with-PushPay">Blameless post-mortems at Pushpay</a></li>
<li><a href="https://github.com/lukevenediger/statsd.net">Statsd.Net</a></li>
<li><a href="http://librato.com">Librato</a></li>
</ul>
<p><a href="https://blog.petegoo.com/2016/05/10/codemania-talk/">Slides from my talk at codemania</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on May 10, 2016.</p>https://blog.petegoo.com/2015/03/15/devops-talk2015-03-15T00:00:00+13:002015-03-15T00:00:00+13:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>On February 19th I <a href="http://www.meetup.com/AKL-NET/events/220001017/">gave a talk at the Auckland.Net meetup</a> titled “Devops for the .Net Developer”. The idea of this talk was to present the context that gave rise to the DevOps movement, outlining it’s drivers, principles and guiding practices and then frame all of this in terms that apply to the average .Net development shop. In other words, sharing a lot of the knowledge I have gained over the last year working in the DevOps space.</p>
<p>I think it was pretty well received. Below are the slides from that talk.</p>
<script async="" class="speakerdeck-embed" data-id="d0803340e6284af793894f9b18a2d42e" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
<h1 id="links">Links</h1>
<ul>
<li><a href="https://www.youtube.com/watch?v=LdOe18KhtT4">10+ Deploys Per Day: Dev and Ops Cooperation at Flickr</a></li>
<li><a href="https://github.com/PeteGoo/tcSlackBuildNotifier">Slack Build Notifier</a></li>
<li><a href="http://librato.com">Librato</a></li>
<li><a href="https://github.com/lukevenediger/statsd.net">Statsd.Net</a></li>
<li><a href="https://github.com/peschuster/graphite-client">Graphite Client (perf counters over statsd)</a></li>
<li><a href="https://dpxdt-test.appspot.com/">Depicted visual diffing</a></li>
</ul>
<p><a href="https://blog.petegoo.com/2015/03/15/devops-talk/">Slides from 'Devops for the .Net developer'</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on March 15, 2015.</p>https://blog.petegoo.com/2015/03/14/teamcity-github2015-03-14T00:00:00+13:002015-03-14T00:00:00+13:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>For a while now I’ve been using the GitHub + TeamCity + Slack combination and I though it would be useful to write down the various tactics and tools for getting the most out of this pretty common configuration of tools.</p>
<p>You don’t need to be using all these tools and services to get something out of these posts but the combination of all three can be pretty powerful.</p>
<p>This post will concentrate on getting the most out of TeamCity + GitHub</p>
<p>The first thing you probably have already done is to configure GitHub as a “VCS Root” in TeamCity. If you haven’t then <a href="https://confluence.jetbrains.com/display/TCD8/Git+%28JetBrains%29">follow the instructions</a> to get it setup. Note that now you can <a href="https://www.youtube.com/watch?v=_FzdCC9imDs">create a project from a URL</a>, this is often the easiest way to setup a new project around a GitHub repository.</p>
<h2 id="matching-teamcity-users-with-github-users">Matching TeamCity users with GitHub users.</h2>
<p>Obviously GitHub and TeamCity have their own lists of users. With GitHub and other git services like BitBucket it is important to understand that your GitHub user account is not automatically stamped against each commit that you make. In fact it is up to you to make sure that the correct name and email is configured in the git clone on your machine in order for these commits to be correctly attributed to you. GitHub will then do it’s best to show your avatar against your commits and track your stats by looking at your commits. If however this information is not configured in your account there is no easy way to update this information without rewriting history.</p>
<p>So the first step is to <a href="https://help.github.com/articles/setting-your-username-in-git/">correctly configure</a> your user.name and user.email git configuration.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git config user.name "Joe Bloggs"
git config user.email "joe.bloggs@example.com"
</code></pre></div></div>
<p>The next thing to do is to make sure that TeamCity is configured to correctly match this information with your TeamCity user. In the Advanced VCS Root settings you will see the following section:</p>
<p><img src="/images/2015/03/teamcity.usernamestyle.png" alt="Username style settings in VCS Roots" /></p>
<p>In my experience the best thing to do is to leave this setting at the default (UserId). This will take the first part of the email configured above and use that to match against TeamCity usernames. This results in the most predictable behaviour as people tend to have various names configured but the email address will probably be quite consistent. We then tend to match our teamcity usernames with our company email address names.</p>
<p>If the usernames don’t match you can go to your user profile in TeamCity and customise the username that will be associated with all VCS roots, a specific VCS Root or even all Git VCS Roots.</p>
<p>The only place the above falls down is when you also use GitHub for personal projects and you end up committing with multiple different email addresses by accident because you have a global default set. Unfortunately TeamCity doesn’t allow you to setup multiple alternative usernames so some of your commits won’t get matches. Hopefully this will be resolved at some point in TeamCity.</p>
<h2 id="reporting-build-status-to-github">Reporting build status to GitHub</h2>
<p>One of the coolest features in GitHub is the ability to have your build process report progress to GitHub. The result of this is that your branches, commits and pull requests will be marked as pending, failed or succeeded. This really comes into it’s own with Pull Requests.</p>
<p><img src="/images/2015/03/branches-with-status.png" alt="Branches view with build status" /></p>
<p><img src="/images/2015/03/pr-with-build-status.png" alt="Pull Request view with build status" /></p>
<p>To enable TeamCity to be able to tell GitHub about the build status you need to download and install the <a href="https://github.com/jonnyzzz/TeamCity.GitHub">TeamCity.GitHub plugin</a>.</p>
<p>Note that you can upload plugin .zip files to the plugins folder using the administration pages on TeamCity, just remember to restart the service for the change to take effect. Also note that for the pull request part to work you will need to make sure you are building branches and PRs as required (see below).</p>
<h2 id="building-branches-and-pull-requests">Building branches and pull requests</h2>
<p>By default TeamCity will probably only be building <code class="language-plaintext highlighter-rouge">master</code>. To enable other branches to get built you will have to also add a <a href="https://confluence.jetbrains.com/display/TCD8/Working+with+Feature+Branches#WorkingwithFeatureBranches-Configuringbranches">branch specification on the VCS Root settings</a>.</p>
<p><img src="/images/2015/03/branch-specification.png" alt="Branch specification" /></p>
<p>The branch specification syntax takes a little getting used to but here are some useful examples. Note that each one should appear on a separate line.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">+:<default></code> - include master</li>
<li><code class="language-plaintext highlighter-rouge">+:refs/heads/(*)</code> - include all branches</li>
<li><code class="language-plaintext highlighter-rouge">-:refs/heads/(spikes-*)</code> - exclude any branches that start with <code class="language-plaintext highlighter-rouge">spike-</code></li>
<li><code class="language-plaintext highlighter-rouge">+:refs/pull/(*)/head</code> - include all pull requests</li>
<li><code class="language-plaintext highlighter-rouge">+:refs/pull/(*)/merge</code> - include the merge result of pull requests *(see below)</li>
</ul>
<p>The parenthesis <code class="language-plaintext highlighter-rouge">()</code> allow you to specify the part of the branch syntax that will be used as the branch name in the TeamCity UI.</p>
<p><a href="http://blog.jetbrains.com/teamcity/2013/02/automatically-building-pull-requests-from-github-with-teamcity/">Building the merge result of pull requests</a> with <code class="language-plaintext highlighter-rouge">refs/pull/(*)/merge</code> is a pretty cool idea. Basically it means that when GitHub knows that the potential merge result of a pull request would change then a build will trigger that not only looks at the PR but attempts to merge it into the parent branch as if someone had pressed the green <code class="language-plaintext highlighter-rouge">merge</code> button in GitHub, before building the code. This seems cool but there are number of problems mostly in that the builds will be triggered ALL THE TIME and your build queue gets swamped with all your PRs building. For example when someone looks at the PR on github.com <a href="https://twitter.com/bradwilson/status/574702084370509824">it will trigger a new build</a> if it detects that something could change in the merge result, we found that as people were skimming over PRs on github.com our TeamCity server got completed swamped. Therefore, we don’t use this feature, instead we just don’t keep long running feature branches.</p>
<blockquote class="twitter-tweet" lang="en"><p>I like that I can get <a href="https://twitter.com/teamcity">@teamcity</a> to auto-build PRs from <a href="https://twitter.com/github">@github</a>. I really hate that looking at a PR on <a href="https://twitter.com/github">@github</a> causes another build.</p>— Brad Wilson (@bradwilson) <a href="https://twitter.com/bradwilson/status/574702084370509824">March 8, 2015</a></blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Note that if you setup a VCS Trigger to initiate your builds when someone has pushed code to GitHub then you can also also specify the same branch syntax in the trigger branch filter field.</p>
<h2 id="triggering-new-builds-when-someone-pushed-code">Triggering new builds when someone pushed code</h2>
<p>By default TeamCity can be configured with a VCS trigger that polls the git repository looking for changes. The only thing is that of course, after you push your code, you will have to wait until the poll interval triggers again.</p>
<p>If your TeamCity server can be reached on the open internet then you can ask GitHub.com to tell TeamCity that changes have been made the instant someone pushed coded to GitHub. To do this you need to go to the Settings of your repository then add the TeamCity service from the WebHooks and Services panel. It may require a username and password unless you have guest access enabled.</p>
<h1 id="using-a-mac-os-teamcity-agent-with-a-windows-teamcity-server">Using a Mac OS TeamCity agent with a Windows TeamCity Server</h1>
<p>This is more of a warning around a very specific set of circumstances. If the following is true:</p>
<ul>
<li>You have a Windows TeamCity server</li>
<li>You setup a Mac OSX TeamCity agent (e.g. to run iOS builds)</li>
<li>Your repo has symlinks in it (like in cucumber / calabash tests)</li>
</ul>
<p>The JGit client used in TeamCity can be a royal pain-in-the-ass sometimes. In the above scenario it will turn those symlinks into useless empty files that freak your build out. You will need to change the VCS settings to “checkout on agent” instead of “checkout on server” meaning that the Windows server will not be trying to send the file changes to a Mac OSX client and failing horribly.</p>
<h1 id="links">Links</h1>
<ul>
<li><a href="https://confluence.jetbrains.com/display/TCD8/Git+%28JetBrains%29">Git in TeamCity documentation</a></li>
<li><a href="https://www.youtube.com/watch?v=_FzdCC9imDs">Creating a TeamCity project from a Git Url</a></li>
<li><a href="https://help.github.com/articles/setting-your-username-in-git/">Setting up your username in git</a></li>
<li><a href="https://github.com/jonnyzzz/TeamCity.GitHub">TeamCity plugin for GitHub Status display</a></li>
<li><a href="https://confluence.jetbrains.com/display/TCD8/Working+with+Feature+Branches#WorkingwithFeatureBranches-Configuringbranches">Working with feature branches in TeamCity</a></li>
<li><a href="http://blog.jetbrains.com/teamcity/2013/02/automatically-building-pull-requests-from-github-with-teamcity/">Building Pull Requests with TeamCity</a></li>
</ul>
<p><a href="https://blog.petegoo.com/2015/03/14/teamcity-github/">Tips and tricks for integrating GitHub + TeamCity</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on March 14, 2015.</p>https://blog.petegoo.com/2014/07/13/teamcity-slack-build-notifier2014-07-13T00:00:00+12:002014-07-13T00:00:00+12:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>We have become immense fans of <a href="http://slack.com">Slack</a> in our office, as have lot of people I know in the software development industry. If you haven’t heard of Slack, well it’s basically a chat system for businesses, much like HipChat or Campfire. The difference being that Slack seems to bring a healthy dose of cool to everything they do, they are iterating incredibly fast right now and seem to be hitting all the right notes.</p>
<p>Ok, so once you have Slack up and running, you turn to integrations with other systems that are going to maximise on the true <a href="https://www.youtube.com/watch?v=NST3u-GjjFw">ChatOps</a> experience. There are already integrations with JIRA, New Relic, GitHub, Twitter and just about every other thing you can think of.</p>
<p>I started to look for a way to have TeamCity notify build results directly into Slack channels (rooms) and found that there were a few options I could have gone with. I could of course use my own chat bot project <a href="https://github.com/mmbot">mmbot</a> to do the notifications for me but I would either need to poll TeamCity continuously or use a webhooks plugin. There is a very good <a href="https://netwolfuk.wordpress.com/category/teamcity/tcplugins/tcwebhooks/">webhooks plugin available already</a> for TeamCity, the only thing is it doesn’t support commit messages / users and it would bring in a chain of communication that is not strictly necessary. Nope, I wanted a plugin for TeamCity that would report directly to Slack.</p>
<p>There is a <a href="https://github.com/Tapadoo/TCSlackNotifierPlugin">Slack plugin already for TeamCity</a> but I wasn’t too keen on the way the notifications looked or the reliance on XML configuration to set it up for each project.</p>
<p>So I decided to take the plunge, learn some Java and setup a new plugin to report build status directly from TeamCity into a Slack room. Initially I started by taking a lot of inspiration from the <a href="https://netwolfuk.wordpress.com/category/teamcity/tcplugins/tcwebhooks/">tcWebHooks plugin</a> I mentioned above. I really liked the configuration experience for this plugin and wanted that experience for my users.</p>
<p>I ended up using <a href="http://www.jetbrains.com/idea/">IntelliJ IDEA from JetBrains</a>, this was by far the easiest IDE to setup although for a n00b it was still really painful in java-land. I’m not sure how much of this was pre-conception vs freaky hard configuration in the JDK etc. The build system is Maven and everything else is largely simple stuff.</p>
<p><img src="/images/2014/07/2014-07-13 21_58_13-build-status_pass.png" alt="Pass" />
<img src="/images/2014/07/2014-07-13 21_58_13-build-status_fail.png" alt="Fail" />
<img src="/images/2014/07/2014-07-13 21_58_13-build-slack-config.png" alt="Configuration" /></p>
<p>As usual the <a href="https://github.com/PeteGoo/tcSlackBuildNotifier">code is on GitHub at PeteGoo/tcSlackBuildNotifier</a>.</p>
<p><a href="https://blog.petegoo.com/2014/07/13/teamcity-slack-build-notifier/">TeamCity Slack Build Notifier</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on July 13, 2014.</p>https://blog.petegoo.com/2014/07/05/project-dependency-viewer2014-07-05T00:00:00+12:002014-07-05T00:00:00+12:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>Recently I faced an issue trying to get my head around a large codebase consisting of multiple solutions and many many projects. The difficulty was in trying to understand the interdependencies between these projects, espectially the ones that are in different solutions. There are tools to do this. NDepend has some neat stuff, Visual Studio Ultimate Edition can do some things and there are others. For my simple scenario though I couldn’t justify the licensing cost.</p>
<p>Luckily I knew that Visual Studio supports the <a href="http://en.wikipedia.org/wiki/DGML">DGML file format</a> in all editions. DGML is essentially a file format where you specify a number of nodes and then links between them as below.</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp"><?xml version='1.0' encoding='utf-8'?></span>
<span class="nt"><DirectedGraph</span> <span class="na">xmlns=</span><span class="s">"http://schemas.microsoft.com/vs/2009/dgml"</span><span class="nt">></span>
<span class="nt"><Nodes></span>
<span class="nt"><Node</span> <span class="na">Id=</span><span class="s">"1"</span> <span class="na">Label=</span><span class="s">"MyCompany.Core"</span> <span class="nt">/></span>
<span class="nt"><Node</span> <span class="na">Id=</span><span class="s">"2"</span> <span class="na">Label=</span><span class="s">"MyCompany.Area1.Service"</span> <span class="nt">/></span>
<span class="nt"><Node</span> <span class="na">Id=</span><span class="s">"3"</span> <span class="na">Label=</span><span class="s">"MyCompany.Area1.Service"</span> <span class="nt">/></span>
<span class="nt"></Nodes></span>
<span class="nt"><Links></span>
<span class="nt"><Link</span> <span class="na">Source=</span><span class="s">"2"</span> <span class="na">Target=</span><span class="s">"1"</span> <span class="nt">/></span>
<span class="nt"><Link</span> <span class="na">Source=</span><span class="s">"1"</span> <span class="na">Target=</span><span class="s">"1"</span> <span class="nt">/></span>
<span class="nt"></Links></span>
<span class="nt"></DirectedGraph></span>
</code></pre></div></div>
<p>Open this file in Visual Studio and you get a nice designer view where you can arrange and change things as you require.</p>
<p><img src="/images/2014/07/2014-07-05 12_31_39-test.dgml.png" alt="dgml file in Visual Studio" /></p>
<p>So I created a simple tool that will generate a diagram like this for you when you give it a folder. It will search the child folders for any csproj, vcxproj and vbproj files, calculate their references and give you the relevant DGML file for you to analyse your dependencies. Simple stuff really.</p>
<p>The repo and binaries are on github at <a href="https://github.com/PeteGoo/ProjectDependencyVisualiser/">PeteGoo/ProjectDependencyVisualiser</a>. Enjoy.</p>
<p><a href="https://blog.petegoo.com/2014/07/05/project-dependency-viewer/">A simple project dependency viewer</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on July 05, 2014.</p>https://blog.petegoo.com/2014/04/27/moved-blog-to-jekyll2014-04-27T00:00:00+12:002014-04-27T00:00:00+12:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p><img src="/images/github.pages.jpg" alt="GitHub Pages" /></p>
<p>Like most of the bloggers on the internet these days I’ve moved my blog off wordpress and onto GitHub pages using Jekyll.</p>
<p>There was a fairly large amount of <a href="http://www.hanselman.com/blog/YakShavingDefinedIllGetThatDoneAsSoonAsIShaveThisYak.aspx">Yak Shaving</a> involved in this process. I’m not going to do a tutorial on how to move from Wordpress to Pages/Jekyll, you can find plenty of info on how to do that on the links below. I will however point out some of the things that threw me.</p>
<ul>
<li><a href="http://hadihariri.com/2013/12/24/migrating-from-wordpress-to-jekyll/">This is a good tutorial</a> by <a href="https://twitter.com/hhariri">Hadi Hariri</a></li>
<li>Follow the <a href="http://jekyllrb.com/docs/migrations/">Jekyll migrations site</a> for instructions on exporting the Wordpress XML. (The Wordpress plugin didn’t work for me)</li>
<li>Learn that Windows is a second class citizen in this toolset and you are going to have to shave some Yaks.</li>
<li>Make sure you <a href="https://help.github.com/articles/using-jekyll-with-pages">install Ruby, Jekyll, Bundler</a></li>
<li>Set up your GitHub pages repo.</li>
<li>Choose a theme and pull it into your repo. Decide whether you are going to fork it or just pull it into your existing repo.</li>
<li>When importing ignore <a href="http://import.jekyllrb.com/docs/wordpressdotcom/">the Jekyll import site bash script</a> and just use <code class="language-plaintext highlighter-rouge">jekyll import wordpressdotcom --source wordpress.xml</code> instead</li>
<li>Make sure to <a href="https://github.com/PeteGoo/petegoo.github.io/commit/2f52eb963ad0ddc76242586c677bcaf300e72fa1">add the correct encoding</a> for your site if you are on Windows.</li>
<li>Watch out for the problem with “{{” characters in your xaml. <a href="http://jekyllrb.com/docs/troubleshooting/">Broken by Liquid 2.0</a>. <a href="https://github.com/PeteGoo/petegoo.github.io/commit/36a553e74edbc18196a2d5989f00c594fe6bd010">Fixable by escaping</a>.</li>
<li>Remember to keep changing the url in the _config.yml to suit your current deployment or strange things might happen.</li>
<li>DO NOT waste time trying to get <a href="http://jekyllrb.com/docs/configuration/">FrontMatter defaults</a> working. I couldn’t and gave up. Wasted sooooo much time on this. Instead I just changed the template to always switch on Disqus comments on posts.</li>
<li>Disqus support admitted that they currently have a problem with their import, hence my old comments are not there yet.</li>
<li>Learn to get permalinks right and use the <a href="https://github.com/jekyll/jekyll-redirect-from">jekyll-redirect-from plugin</a> which GitHub pages supports.</li>
<li>Worst of all is the rss feed. My previous blog’s feed was at <code class="language-plaintext highlighter-rouge">/index.php/feed/</code> while the default templates for jekyll puts it at <code class="language-plaintext highlighter-rouge">/feed.xml</code>. This wouldn’t be too hard but the redirect plugin for jekyll uses html based redirects rather than proper 301s, Feedly only likes 301s. Add to that the fact that GitHub pages doesn’t do <code class="language-plaintext highlighter-rouge">.htaccess</code> files and you have a recipe for disaster. Luckily there is a hack, create an <em>index.php</em> folder and a <em>feed</em> folder and inside put a copy of the <em>feed.xml</em> file renamed to <em>index.html</em>. Although the content-type of the feed response is now <em>text/html</em> it seems to work none-the-less.</li>
</ul>
<p>Good Luck!</p>
<p><a href="https://blog.petegoo.com/2014/04/27/moved-blog-to-jekyll/">Moved blog to GitHub Pages and Jekyll</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on April 27, 2014.</p>https://blog.petegoo.com/2013/10/13/introducing-mmbot-a-c-hubot-port2013-10-13T00:00:00+13:002013-10-13T00:00:00+13:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>I’ve been playing recently with github’s <a href="http://github.com/github/hubot" target="_blank">Hubot</a> chat bot, written in Coffee script on Node. I wanted to connect it to our <a href="http://about.jabbr.net" target="_blank">jabbr</a> instance with the least amount of friction and to provide our team with an office jukebox, build automation and various meme-oriented distractions. There is a <a href="https://github.com/smoak/hubot-jabbr" target="_blank">pretty good adapter</a> for Hubot to do this but after having a few issues I decided it might be easier to do a port to C# so I could more easily write scripts and connect to jabbr (which is written in C# and has some ready made APIs).</p>
<p>The result is <a href="http://github.com/petegoo/mmbot" target="_blank">mmbot</a>, a hubot port that follows the same basic architecture of Hubot with minimal changes. The basic goals of mmbot are thusly:</p>
<ol>
<li><strong>Provide a chat bot written in C# with all the functionality of Hubot but with a script environment more familiar to .Net devs.</strong>
<li><strong>Hubot scripts should be easy to convert into mmbot scripts.<br></strong>This may mean that there are some weird overloads in the API for writing scripts that look like hubot scripts but it should still be very usable, customizable and familiar to .Net devs.
<li><strong>ScriptCS style scripts should be automatically picked up and run from a scripts folder</strong>
<li><strong>Hubot scripts should be easy to convert into mmbot scripts.<br></strong>There are some blockers here in the NuGet package resolution and <a href="https://github.com/scriptcs/scriptcs/issues/243">dynamic loading of scripts</a></li>
</ol>
<p>Currently there are 2 adapters for mmbot - jabbr and HipChat. The jabbr adapter is the most used at the moment but the HipChat one should also be working.</p>
<p>Scripts can be written in code by implementing IMMBotScript or by dropping a scriptcs csx file into a scripts folder beside the executable. The pre-compiled approach gives you the power of async/await and the speed of using dynamic while scriptcs means you don’t need to create dll. The experience of porting Hubot scripts has so far been pretty painless as the API was designed to make this process incredibly easy.</p>
<p>Here is the hubot math script in scriptcs form</p>
<p><script src="https://gist.github.com/PeteGoo/6956172.js"></script>
<p>And as an IMMBot script</p>
<p><script src="https://gist.github.com/PeteGoo/6956182.js"></script>
<p>Notice that there is an Http fluent style helper for creating requests and processing the responses using HttpClient and Json.Net.</p>
<h2>Starting mmbot</h2>
<p>Starting mmbot is easy. You can choose to configure him by environment variables or by passing in config parameters in code</p>
<p><script src="https://gist.github.com/PeteGoo/6956204.js"></script><br />
<h2>What scripts does it have?</h2>
<p>mmbot currently has the following scripts</p>
<ul>
<li>Searching for images, animated gifs, cats, pugs, maps, youtube videos</li>
<li>Urban Dictionary definitions</li>
<li>Ascii art generator</li>
<li>Mustache me (place a mustache on someone’s face)</li>
<li>Xkcd comics</li>
<li>Spotify player /office jukebox with playlist, query, album, track playing and queuing and volume control.</li>
<li>Jetbrains TeamCity build server – querying build status, starting builds etc.</li>
</ul>
<p>Thanks to <a href="http://dkdevelopment.net" target="_blank">Damian Karzon</a> for contributing some of the scripts. Check out the current catalog at <a href="http://github.com/petegoo/mmbot">http://github.com/petegoo/mmbot</a></p>
<h2>What's next?</h2>
<p>Next for mmbot is to try to implement a few cooler features like loading and saving scripts from gists, starting from scriptcs (some issues here currently) and to get the script catalog expanded.</p>
<p><a href="http://github.com/petegoo/mmbot" target="_blank">Go check out the mmbot code on github</a>.</p>
<p><a href="https://blog.petegoo.com/2013/10/13/introducing-mmbot-a-c-hubot-port/">Introducing mmbot, a C# Hubot port</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on October 13, 2013.</p>https://blog.petegoo.com/2013/08/06/making-the-systray-transparent-on-windows-phone2013-08-06T00:00:00+12:002013-08-06T00:00:00+12:00Peter Goodmanhttps://blog.petegoo.comblog@petegoo.com<p>I’ve used a semi-transparent AppBar on windows phone a number of times to get the visual effect I wanted and to stop the dreaded jumping frame when transitioning pages but I didn’t think it would be possible with the SysTray. Turns out that the SysTray in Windows Phone can be made transparent.</p>
<p>This is pretty useful if you want to keep an image background that spans the entire height of the screen. Simply add the attributes below to your root PhoneApplicationPage element.</p>
<pre class="brush:xml">
shell:SystemTray.IsVisible="True"
shell:SystemTray.ForegroundColor="Yellow"
shell:SystemTray.Opacity="0"
</pre>
<p>and the result should look like this:</p>
<p><img class="alignnone size-medium wp-image-424" alt="systray" src="/images/2013/08/systray1-300x94.png" width="300" height="94" /></p>
<p><a href="https://blog.petegoo.com/2013/08/06/making-the-systray-transparent-on-windows-phone/">Making the SysTray transparent on Windows Phone</a> was originally published by Peter Goodman at <a href="https://blog.petegoo.com">PeteGoo</a> on August 06, 2013.</p>