Goodhart's Story Points
Are you using story points wrong?
When was last time you opened up the Scrum Guide[1]?
Teams who work within the Scrum framework often find themselves using some distorted version of Scrum as taught by a consultant, manager, scrum-master or even just how ‘everybody knows it works.’ Lots of the things that Scrum teams treat as gospel is at best, from the apocrypha. We’re talking looking up?USS Enterprise D?on Memory Alpha rather than just watching?The Next Generation, or reading a review of the academic literature rather than going to the original article. All of these things have value, but sometimes it’s worth going back to the primary source. And in Scrum, that’s the Scrum Guide.
One of those things that ‘everybody knows’ (at least in some teams) are story points. Story points tend to have the following characteristics:
This feels like the point that I’d jump on my proverbial soapbox to scream that ‘everything you believe is wrong!’ But - it’s not. These are all helpful ways of thinking about story points, and can potentially be abstractions that are helpful in forming a productive Scrum team.
I would however suggest that there is one far more insidious element to story points that can render them a risk to an effective Scrum team: they are numbers. They can be accounted, they can be metric’d, they can be statistic’d.?Ultimately, this regularly leads to story points being misused in ways that compromises their value to the team.
So - I promised relying on the original source. What does the Scrum Guide have to say about story points?
Absolutely nothing.
The earliest reference to what seems like proto-story points comes from a 2002 paper on Planning Poker by Greening[5]. Greening does define them as being equal to ‘days of work.’ It seems to have been popularised as a method of Agile Estimation by Cohn in 2004[6]. The Scrum Guide provides fairly rigid requirements about what a Scrum team needs to do (like time-boxed stand-ups), but very loose requirements about?how?a Scrum team goes about those. It’s expected that a team will use refinement to improve and adjust their processes over time. As such - Story Points can fit right in. But where?
Per the Scrum Guide:
I would suggest that therefore that the purpose of story points, within a Scrum framework, is primarily:
Anything?that disrupts any of these three functions might be something that needs some further consideration.
To see?why?they may need further consideration, we’re going to have a look at Goodhart’s law[9]. To paraphrase, Goodhart’s law states that ‘when a measure becomes a target, it ceases to be a good measure.’
Story points, by their aforementioned accountability, can lead to some potentially useful measures. For instance:
Again - these aren’t bad things?in and of themselves. But - how can these be used? Some organisations measure the performance of teams by the amount of points that they complete. Some measure individuals by the amount of points that they complete. Some set targets of amount of points to complete. Heck - some teams measure themselves and their internal improvement by how many points they complete. The problem with all of these uses of story points is that they link measuring the performance of a team to an internal tool that is intended to assist the team in making decisions.
And hence, we find ourselves meeting?Charles Goodhart?and his eponymous law. Once some of those measures become targets, they can very quickly be distorted by the participants in the process or distort the process itself. And the insidious secret befouling this distortion is that whilst some teams may ‘game the system’, the system is often bent completely unintentionally. Sometimes, the consequences can go beyond the use of story points to effecting the quality of work itself.
Let’s look at each of these three good measures, and some of the dangers that they may introduce when they become targets. This is a good point to note that my examples are going to be from the Software Development space as it’s what I’m most familiar with. Although, I suspect that these lessons may be general in nature, and it won’t take much imagination to find equivalent consequences in any knowledge-based field.
Velocity
Ahh, Velocity. Grandfather of Agile Metrics, progenitor of all. I’m specifically using the definition as ‘sum of story points completed during a sprint’.
Inflation
The biggest risk I’ve directly witnessed with Velocity has been Story Point Inflation[11]?(Inflation). For those familiar with the economic term, it’s exactly what it sounds like: equivalent cards receive higher story point estimates over time. Some cards will legitimately receive higher story point estimates based on what they learned from last time (just as an agile team should do) - but some cards should be receiving lower estimates based on past experience too. If the Velocity is slowly creeping up sprint-by-sprint, it’s tough to figure out whether the team is actually doing more work, or if their estimates are changing due to inflation.
Inflation can be a risk where story points are being used legitimately, as it can make it tough to know how much work a team should bring into a sprint, and tough to tell whether changes in process (such as those made in retrospectives) are having a positive effect. When Velocity becomes a target, Inflation turns merely from a factor to be cognisant of, into a gaping maw with a perverse incentive to shove your arm right in. Suddenly, it’s in the teams best interest to overestimate every card and break out testing/deployment into separate cards (still with those Fibonacci estimates) - after all, the more points they complete, the better they’ll be seen to be. This obviously can lead to knock-on effects where Product Owners?[12]?are disempowered from being able to figure out what they can expect a team to deliver. Ironically, this means that the Target Velocity, which was supposed to bring?more?certainty to the business instead brings less.
Over-committing
Targeting a constant increase of Velocity can - if merely accomplished by pushing harder - cause short term increases to Velocity at the expense of the long term productivity of a team through a couple of ways. It can encourage teams to over-commit to work, which can have side effects of increasing burn-out in the team[13]. Over-committing reduces the ‘slack’ time which team members should use for making the team more efficient, by - for instance:
Also - from what I’ve witnessed, a team that is over-committed and targeting velocity tends to lead to a team pushing itself too hard. A team that is running as hard as it possibly can during the sprint tends to tire itself out: many programmers (or indeed people generally) can’t handle that level of stress for an extended period of time.
Quality Slippage
All teams working in a Scrum framework should have a ‘Definition of Done’[14]?- but targeting velocity can mean that what a team is willing to accept may fall short of that definition.
Our goal with Scrum is to complete the work that has been forecasted and planned - or, to look at?The Scrum Guide?itself, “The Sprint Goal is the single objective for the sprint”[15]. In practice, the desire to ‘complete’ the sprint can lead to the Definition of Done being compromised (even in the absence of a velocity target) as developers prioritize completion over correctness. A Scrum team is supposed to be self-managing[16], and part of that is that it needs to be disciplined. Velocity targets tend to come from external sources (such as managers[17]). They can therefore exacerbate the existing risk by putting external pressure on the team’s discipline for the team to compromise on its definition of Done.
The Definition of Done is how processes like code-review, manual testing and automated testing[18] which are good to do become codified into the expectations around the team. They’re also what tend to get compromised when a team needs to ‘just deliver the damned feature’.
Internal competition
Some organisations - formally or informally - compare teams against one another using metrics (Stack Ranking), and may reward or punish teams based on those metrics.[19]?I’m only going to address this very briefly as a detailed examination of stack ranking is outside the scope of what I’m going to discuss here, and the effect of stack ranking on Agile teams has already been noted[20]?and expanded upon.
Velocity is compromised as a measure when it needs to be consistent with the Velocity measurements of other teams, and becomes further compromised when teams are incentivised to ‘game’ their points to have appeared to have completed more work than other teams. To put it succinctly, forcing teams to compete on velocity will ultimately succeed in turning Velocity into a useless measure by encouraging the other risks on Velocity presented here.
But…
Do you want to know the worst thing about Velocity as a target? When Velocity is a measure, it is the culmination of story-points being used in The Right Way?. Velocity can be immensely useful for a team to plan how much work they will bring into the sprint.[21]
The Cursed Velocity
The Cursed Velocity is my own definition for velocity, as measured by the number of cards completed. I’m calling it The Cursed Velocity, because it should never ever be done, and there is no excuse for using it.
The amount of work in cards varies wildly, there is no equivalence of a one-line change to spelling, to delivering part of a major refactor. To measure a team by the number of cards they do completely ignores the amount of work involved in what they are doing. It is easily corrupted by a measure by splitting cards into non-deliverable chunks of work, rather than individually useful or releasable cards.
Honestly, even ignoring Goodhart’s Law, I honestly doubt its efficacy as a measure even when it’s not being used as a target.
Individual Velocity
Individual Velocity is the velocity per individual team member. For the sake of this article, I’m using the definition of Individual Velocity being the sum of the story points of the cards that a developer developed. This definition may change in some workplaces to (for instance) incorporate code review, which would offset at least some of the issues presented here.
领英推荐
Risk aversion
Risk aversion, in itself, is not necessarily actually a bad thing in a software development team for either individuals or teams as a whole. In fact - I would suggest that risk aversion can lead to one particularly good outcome where permitted to do so: high quality refinement[22]?of cards. A team, or individuals, that are given sufficient time to do so will be prepared during refinement:
Each of those things actively leads to story points being more useful for their original purpose: estimating how much of the backlog to bring into the forthcoming sprint.
Risk aversion, particularly when placed upon an individuals shoulders, can also have a deleterious effect on the outcomes of a team. In my experience, higher performing Scrum teams pick up the hardest or biggest pieces of work early in the sprint to maximize the chance of them being completed. Team members who are averse to risk will be drawn firstly to the least risky long pieces of work (a big chunk of story points completed with some certainty), followed by the least risky short pieces of work (a small chunk of story points with some certainty), followed by the riskier cards. After all, those riskier cards have a higher chance of ‘blowing out’, and resulting in the individual completing less story points worth of work.
When those issues are hit earlier in the sprint, there is often a chance for the team to ‘come together’ and get it over the line before the sprint completes. When those issues are hit on day 8 of a 10 day sprint, that card is probably getting rolled over into the next sprint. Remember - according the?The Scrum Guide?- the purpose of a sprint is to complete the Sprint Goal[23]. Cards getting rolled over is obviously contrary to achieving the Sprint Goal, or indeed completing what the team forecasted that they would be able to.
Quality controls and ‘axe sharpening’
Quality controls and ‘axe sharpening’ are bound together here as both are work that is essential for the running of a good software development team, and neither are (usually) reflected in the Individual Velocity. The very brief issue here is that targeting Individual Velocity discourages any work that is not directly reflected by the amount of story points that an individual achieves during a given sprint.
So - what kind of quality controls am I talking about? In short, the things other than coding that fall under the team’s Definition of Done[24]. This will include[25]?at the least some process requiring code review, and some level of testing. Unless these things are somehow incorporated into your Individual Velocity metric, they may be neglected. Code review can be an onerous and time heavy task - particularly when it involves a large refactor, or requires significant mentorship between a senior and junior developer. Any time spent on code review will naturally cut into the time that a developer is able to do their own development. Code review is immensely valuable to both the project and the developers, for either seniors or junior developers to engage with[26]. Testing by a cross-functional team is also a necessity, whether by dedicated testers or by cross-functional developers. Again, not reflecting that testing in the Individual Velocity will discourage developers from spending as much time on testing as may be necessary.
I’m using the term ‘axe sharpening’ from a quote attributed to Abe Lincoln that “[if you g]ive me six hours to chop down a tree and I will spend the first four sharpening the axe.”[27]?Axe sharpening, in a software context, is doing things like:
These are all things that will make an individual team more productive and achieve more things in the future, but won’t be reflected in the points they achieve in the present sprint. An individual who is hyper-focussed on achieving the most points?right now?is going to be (metaphorically) blunting their axe and potentially inhibiting their future growth.
Internal competition
Like at the team level, some organisations - again, formally or informally - compare individual team members against one another another using metrics, and may reward or punish individuals based on those metrics. The same risks that apply to a team using Velocity as a metrics can apply to the intrateam dynamics when Individual Velocity is being used to compare members of a team against one another. The same risks of ‘gaming the system’ apply (such as encouraging developers to ‘play up’ the amount of work involved in estimating) - particularly if there are cards that one team member is particularly suited to developing.[28]
Further, if the members of a team are being compared against each other, the team has a perverse incentive for them to compete rather than work as a team. Which doesn’t strictly fall within my critique of the use of story points, but was worthing taking note of.
Estimated cost
Here’s some logic for you. A software developer costs $X/hour and can do Y hours per week. They’ve been good little mostly consistent software engineers, and we know that they deliver Z story points per week, usually. So, I can totally figure out how much each story point costs me! I can know exactly how much any feature (or bug, I guess, if I must) is going to cost, once they’ve estimated it!
I think, for some of you, that might sound just a little familiar. Maybe you’ve seen it influence your work (and estimates), or maybe you’ve been tempted by the Dark Side of the Metrics yourself. And like so many of the potential abuses I’ve spoken about, this one again may not be entirely bad - being able to account for the time that your team is going to spend building something is actually really helpful. Being able to go ‘Great! We can do that. It’s probably going to cost you about $X. Do you still want to do it?’ to somebody up your leadership-tree who is prone to pivoting to their latest idea can even support Scrum working in the way it is supposed to by giving scrum masters tools to protect the sprint, and product owners tools to protect the backlog.
However, we must remember that the whole point of story points is to assist teams in figuring out how much stuff from the backlog they can bring into a new sprint. When the development team is exposed to points being costed like this, their primary purpose can again be hindered.
Pointing Pressure?is (for this article’s sake) where developers are implicitly or explicitly pressured to change their point estimates for reasons beyond having a better understanding of the work involved. This might look like:
Most of these scenarios could potentially happen with or without using story points for estimated cost. Heck - I’ve probably been guilty of the fourth scenario there myself. But, estimating costs with story points puts an additional pressure to ‘game the points’, and make them more agreeable to stakeholders. And, if you’re being pressured to turn that 8 into a 13, or that 5 into a 1, you’re going to really struggle to figure out how many of those 13 point cards (that are actually 8s) and 1 point cards (that are actually 5s) your team is going to be able to complete within the sprint. Plus, hey, if despite this article you are still using Velocity as a target, you’re going to wreck havoc with your metrics.
Solutions
Here’s the single, most basic TL;DR takeaway for this whole article:
Stop using story points as targets.
But, that might be too simple, and further, it may be out of your control. There are more suggestions we can take from taking a critical look at how teams use story points though.
T-Shirt sizing
Right from the start, I suggested that an issue with story points is that they are numbers, and can be treated in all the ways that a number can. But - there is no requirement that work be sized using story points, or numbers at all - merely that you can forecast what work a team is able to achieved. One potential way of addressing this is?T-Shirt sizing. T-Shirt sizing is where, instead of using numbers to represent size, you use sizes (like on a t-shirt) - normally Small, Medium, Large, and as many levels of Extra Large as a team member needs to occasionally protest, and point out that ‘this is a really bad idea’. Like story points, a team will usually have a rough idea of how much work is involved in a each size. So, a team may know that for them, a small card is likely to be less than a day.
Just like story points, they end up being a bit abstracted and being a measure of complexity. They’re not really that different. Except, by not being numbers, they don’t lend themselves to being used and abused like numbers can be. For the team themselves, they feel slightly more imprecise, which can prevent some of those negative feelings associated with not getting an estimate exactly right. They give enough information to assist a team in planning, but require the team to actively engage with deciding which cards can come in, as they’re no longer able to just get the average velocity, and shove that many points in. I would suggest that the inability to use t-shirt sizes for almost anything except estimating the work that can be brought into a sprint is not a bug, but is a feature.
It doesn’t feel like suggesting to drop using story points at all should feel so radical and controversial (despite them not being in the Scrum Guide), but I’m now sitting at my laptop and checking over my shoulder, nervous that I’m not able to get away with the suggestion (whilst wondering if I need a red revolutionary t-shirt of ‘Pragmatic’ Dave Thomas with a crossed keyboard and mouse symbol).
Professionals at work
As software engineers, we are knowledge workers, not assembly line workers. Scrum is not meant to commodify the work that we do, but instead empower self-organising teams to develop more effectively - in essence, to treat software engineers like the professionals that they are. This also means that software engineers and scrum masters should[29]?be in a position where they can be advocates for better software process and Scrum process. It means that scrum masters and software engineers should respectfully and professionally push back when they see story points being used in a way that jeopardises the teams ability to function effectively.
There are multiple reasons why this isn’t actually easy (beyond a general desire to avoid conflict). One of those reasons is that lots of companies that engage software engineers, even those that claim to be agile and do Scrum, don’t actually really understand the Scrum process.
Software engineers and scrum masters should, where able, educate their organisation and stakeholders about how the Scrum process works. This will sometimes mean engaging with managers, or other more senior staff. The intent is that teams will be able to stick to the actual rules of Scrum by ensuring that stakeholders actually understand the process, and ideally, witness it working. This will assist teams that need to argue against violating Scrum principals by, for instance, measuring the effectiveness of a team by the story points it completes.
Effective professional developers must also build trust with their stakeholders and management. I suspect that a large part of the reason that organisations or individuals gravitate towards abusing metrics with software teams is that they fundamentally don’t trust their software developers to do their jobs, and think they require some level of micromanagement to get the most out of them. A team that can consistently deliver the things that their stakeholders need, and are transparent about the issues that they are facing, will be more trusted by their stakeholders. Teams that are trusted and listened to are not teams that are measured exclusively by their numbers.
Management
Velocity and individual velocity might be useful tools for a manager to have some insight into the effectiveness of their team, but they are only numbers. As I’ve demonstrated here, easily manipulatable numbers. For managers, I would suggest the following:
You can actually look at the velocity of a team, or its members. It?may?reveal whether the team is improving, overly reliant on one person, chronic at underestimating etc. Where it becomes an issue is where you start?judging?the effectiveness of your team based on their velocity, or targeting specific numbers (for instance, in places like KPIs). I would further suggest that you seriously consider not raising those metrics with the team themselves - as, intended or not, it will likely have some effect on how they distribute story points.
Which also means that you can’t rely on story-points accomplished as a metric when judging performance of your team. Assessing the effectiveness of software engineers is a topic that warrants separate thought - and is outside the scope of what I can address here. But, if you were relying on story-points for formal performance metrics, you must instead better understand the work that your individual team members do. A developer may accomplish less points due to taking a higher burden of code review than the remainder of the team. A senior engineer arguably?should?sometimes be accomplishing less story points than some of their junior colleagues, as they should be spending some of their time mentoring, training, teaching, reviewing and planning.
A team may be suffering from a lower velocity not due to inability or lack of effort, but due to requirements to develop on a legacy system. Seeing the lower velocity is useful - it shows that there is an issue that may be able to be addressed (for instance, by refactoring), but when it’s used as a target, you may instead find a software team who tries to avoid working in that area again.
It’s a good thing to know what your team is up to - but you need to understand what they are doing at a deeper level than the amount of story points that they are completing.
Conclusion
Hopefully, you’ve been able to draw from all this that story points actually can be very helpful. However, by incorporating them into targets, you end up potentially rendering them ineffective at their primary purpose of allowing a team to effectively forecast work. It’s always worth going back to the Scrum guide itself just to verify the principals that are actually prescribed, because sometimes it’s tough to distinguish between Scrum, and all the things we’ve added onto it. And many of those things we’ve added on to it can, ultimately, make a Scrum team less effective.
Remember friends: be agile. Don’t just do Agile.
Footnotes
This article has been written in my personal capacity, and does not necessarily reflect the opinion of my employer. Original artwork is by Lachlan Kingsford, and is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license .