Teaching is the best teacher

I just finished a fun-and-fact-filled 2-day workshop on Riding the Architect Elevator. At the beginning of the workshop I highlight to attendees that I do these workshops because it’s a great learning opportunity for me as well – having to explain things is the best way to really understand them. I also take away a lot from the many exercises and discussions because it allows me to harvest examples from the attendees’ diverse contexts. Lastly, it sharpens my arguments and storylines – nothing tests your logic better than a dozen smart architects asking questions.

Teaching done right is therefore a definite two-way street. If you attend a training course, you’ll notice quickly whether your instructor is there to teach only or also to learn. I expect you’ll find the latter much more valuable.

Show the pirate ship

One of the workshop exercises is based on the chapter “Show kids the pirate ship” in my book. This advice is a reference to a typical Lego box: the cover doesn’t show all the tiny pieces inside the box, but it shows the pirate ship (or whichever thing you can build from the pieces). Seeing the final product excites the kids and shows the actual purpose of all the pieces on the inside.

Sadly, IT tends to do the exact opposite: we love to show all the little pieces in excruciating detail but forget to show the pirate ship that comes out of it. This is one of the many reasons that to the business IT remains a mysterious “black box” where a lot of money goes in and little comes out.

Monitoring architecture

In our workshops we have the participating architects draw out a system structure. Because the purpose of the exercise is to see different ways of illustrating an architecture I wanted to pick a system that’s quite well understood to almost all attendees. I therefore used an abstract monitoring system, which checks the health of applications and alerts if something is amiss. The nice thing about a monitoring system is that most every engineer should have interacted with one.

As part of the exercise I hand a stack of little cards to the teams of architects. Their task is to make a “good” architecture diagram that incorporates all those pieces. We do this in small teams so we can compare and critique the results.

Monitoring architecture team exercise

The cards contain components like Black-box monitoring, White-box monitoring, Log aggregator, Time-series database, Triggers, Alerts - all well-understood pieces of a monitoring system.

Teams tend to have quite a bit of fun with the exercise. After about 10 minutes of discussing, sorting cards, and drawing on flip charts. They generally come up with a diagram like this:

Drawing a monitoring architecture

The drawings generally have a clean data and/or control flow from the application through the sensors, logs, log aggregation, the time-series database, alerts, down to the operator. The diagrams are generally well structured and have a visual language that expresses the semantics of the underlying system.

Considering the purpose

After presenting and discussing the sketches, I generally ask an innocent-sounding question “what’s the purpose of this system?” Initially, most attendees consider the purpose to of the system as detecting anomalies or outages and alerting someone. After a bit of prodding, the architects start to “zoom out” and identify the real purpose to be maximizing system availability: if you don’t care about system availability, you don’t need any monitoring. The system maximizes availability by identifying system downtime and enabling speedy recovery.

Closing the loop

Next, we venture to augment the picture to “show the pirate ship”. In our case, the pirate ship is to maximize system availability by minimizing downtime. We literally close the loop by drawing a connection from the operator to resolving the issue that triggered an alert.

Showing the pirate ship

Once we have the loop from System Under Test to Alert and Resolution in the picture, we can highlight the purpose of the system visually: minimize the time from the issue taking place to resolving it. This is the system’s Mean Time To Recovery (MTTR). We draw this as a bold statement in the middle of the loop.

Making better decisions

Once the purpose and the complete system are clear, we can see that having the “full picture”, so to speak, helps us make better decisions. It’s now apparent that the MTTR is made up of two halves: how long does it take to detect an outage and how long does it take to resolve it?

Once this aspect is clear, one can reason about whether the company should invest in a better monitoring system. For example, investing in a monitoring system that reduces the time to detect outages from half an hour to a few minutes thanks to better sensors and smarter analytics may seem like a good idea. Once you consider, though, that resolving an outage takes several hours, the picture changes. Investing let’s say half a million Dollars to reduce the MTTR from 4.5 hours to 4.1 hours doesn’t look that great anymore. Instead, you’d be looking to reduce the time spent resolving outages, e.g. by better transparency across systems or higher levels of automation that can quickly roll back the deployed software to an earlier, stable version. Drawing a better picture has helped us make better decisions.

Teaching the teacher

Admittedly, I trick the participants a little but by handing them only cards describing the “monitoring” side of the system. At the same time, the ability to detect missing pieces and “zooming out” to see the bigger picture are essential capabilities of an architect.

The most valuable part for me is that the exercise didn’t start out this way. Originally it was just a way to draw a few architectures and compare them. Through the dialog with attendees it evolved into combining it with the Pirate Ship and decision making, drawing multiple elements of the class into a single exercise. Teaching really is the best way to learn.