Measuring work performance

In The value of your work I have talked about how in a job, work is useful for your individual career only if your manager deems it so, even though that is not always the same thing as what is valuable for your team or your organization. In fact, it is rather difficult to come up with a good, unqualified definition of what even “valuable” means. Now I want to talk a bit about the same phenomenon but from the opposite perspective: what does this mean for organizations and managers?

If you are a department head, manager, or team lead, you should be extremely mindful of the bottom line: whatever you incentivize, knowingly or otherwise, is what is generally going to happen, and what you disincentivize, is generally not going to happen. Know also that you will unavoidably be creating incentives and disincentives without even being conscious of doing so.

The problem lies in how to measure value. It is simply not possible for you to examine everything that each of your reports do in full detail, even though that is the only way with which you could even hope to have a decent measure of their worth and value. (Note that even this would be no guarantee, due to your own inescapable perceptions and biases.) Therefore, you have to rely on proxies: some kind of measurements that – hopefully – correlate well with the actual value produced to be useful. You need a metric. Incentives however get created implicitly with each such metric.

For example, you could try measuring the number of lines of code that each of your software developers produce. More code does more stuff, and more stuff makes more money, yes? Well… maybe, sometimes, but not all code is created equal¹. Even putting aside whether the metric is a good one, consider the incentive you have created: to write more lines of code. Your developers will quickly learn that what they should to get raises and promotions is to write more lines of code, and do just barely good enough work otherwise to avoid suspicion. Oh, you will get more lines of code, that is for sure, and it will look amazing if you plot it on a graph and show it at a board meeting… but it will almost certainly not mean that more features were added or bugs avoided and fixed; likely on the contrary, since now the focus of your reports is maximizing the lines of code whether it makes sense or not, which will definitely be detrimental to real value created (whatever that mystical beast even might be).

You can play this cat-and-mouse game for a while, but nature and people will always find a way. For example, suppose that you catch on to one of the tricks used to maximize the number of lines:

now
each
word
is
in
its
own
line.

Oh no! Okay, so you change the metric to be for number of characters instead of number of lines, and now your codebase looks like this:

// defines a new variable called "foo" of type integer (whole signed number) and assigns it an initial value of 0 (zero, nothing)
int foo = 0;

// Adds the value of "bar" to that of "foo" numerically, for example if "bar" is 4, then "foo" will now be 4 as well because "foo" was initially 0 (see the previous line).
// Note that functionally, we could also just use a simple assignment in this case, or, you know, just initialize "foo" to be "bar" to begin with, but This Is More Maintainable (TM).
foo += bar;

Oh dear. It’s okay though, the solution is simple: just exclude comments from your counting tool. But wait, what’s this, why did the above code turn into:

#if CHARACTERS_ARE_BEING_COUNTED
This is not a comment!
Defines a new variable called "foo" of type integer (whole signed number) and assigns it an initial value of 0 (zero, nothing).
#endif
int foo = 0;

#if CHARACTERS_ARE_BEING_COUNTED
Adds the value of "bar" to that of "foo" numerically, for example if "bar" is 4, then "foo" will now be 4 as well because "foo" was initially 0 (see the previous line).
Note that functionally, we could also just use a simple assignment in this case, or, you know, just initialize "foo" to be "bar" to begin with, but This Is More Maintainable (TM).
#endif
foo += bar;

CHARACTERS_ARE_BEING_COUNTED is never defined anywhere, therefore all that text is essentially a comment anyway, but now you need a C++ preprocessor to defeat this particular gaming of your metric. Man! It’s annoying, but you get through this.

Now Bob is complaining that he is working on complex multi-threaded computational code that is easily twice as hard to build as Joe’s code. That is a fair point indeed, so you change your metric to count double any code that is multi-threaded. You see where this is going, right? You guessed it: now all of the code is multi-threaded, for absolutely no reason, and your production servers caught fire because Joe’s button coloring code now uses 16 threads.

I could go on (and I can’t say I’m not tempted, this is fun) but I consider the point made: what gets produced is dictated by what metric you use, or, simply put: you will get what you measure, at the exclusion of all else. This is known as Goodhart’s law: when a measure becomes a target, it ceases to be a good measure.

Of course, you, dear reader, know better than this and would never introduce such a silly metric as a measure of productivity. Okay, but what do you use then? You could try measuring number of tickets resolved: but wait, what used to be 63 tickets is now over 18 000 and growing because some guy figured out how to automate creating and resolving tickets.

It is possible to create decent metrics for some jobs, but the more involved the job is, the exponentially more difficult it gets, and when it comes to software development, it becomes nigh impossible. This results in invisible, interpersonal metrics: who stands out to you or the other managers as someone who solves problems and pushes out feature after feature like a machine. Realize though that with this you have created a different incentive.

Consider two of your reports: Joe and Bob. Joe is super social, everybody likes him, and he always seems to find an opportunity to mention the latest feature he added or bug he has fixed. It is really impressive stuff! Just yesterday he spent the entire afternoon adding a new button to a form, which was apparently really difficult because of something to do with layouts and flexbox and margins and viewports and mobile devices. You are really glad to have somebody like Joe on your team who can fix complicated problems like that! Then there is Bob, nobody is super sure what he does, as he mostly keeps to himself. Yesterday he muttered something about optimizing the calculation engine so that each calculation now takes 20 milliseconds instead of 3 seconds, but you are not super sure what a millisecond is or why this is useful, and Bob did not seem too inclined to explain. 20 seems to be a bigger number than 3, so that seems worse, but it is hard to tell.

The incentive you have now is for developers to make their work well-known and visible. They will actively avoid tasks which are less visible, and therefore get less credit. You are now selecting for developers with good social and communication skills first, and programming skill only second, at best. Highly complex, technical issues, such as networking code that is responsible for keeping your system functioning reliably even under heavy load is now avoided, as it is immensely difficult to work with and then challenging to explain to a non-technical manager, unless you can show a graph for the number of incidents dropping dramatically since you have rolled out your new version or something.

In real life, most companies end up using some combination of concrete metrics (such as tickets closed) and personal impressions gathered during daily or weekly meetings, one-on-one meetings, and casual chats by the coffee machine. Because concrete metrics are easy to game as we have seen, the latter tends to have a higher weight, hence, good salespeople tend to be rewarded more than good professionals.

There is value generated not just by implementing new features, but also fixing bugs and issues. I have mentioned the Preparedness Paradox in the previous article, and this is as standard as it gets. Fixing disasters before they occur is not a good strategy, because it is difficult to get credit for: there is no emotional weight attached to an event that did not happen, even if it would have been Very Bad. However, coming in at 4 AM on a Saturday to fix a critical production issue whereby all servers caught fire and working on it all day so that by Sunday the system can work again tends to get valued immensely, and the person gets hailed a hero. The same extends to fixing regular run-of-the-mill bugs in the code: a bug that never made it into production is less visible than a bug that did, and was then gloriously defeated in single combat, documented by a ticket and all.

Every leader and manager is human, and is subject to all the usual human biases, which then affect all of the people working around them, and can even define the experience of their reports. Every measure of performance and productivity is bad, some are just less bad than others. There is no silver bullet: you can only try to actively and consciously seek as much data as possible from as many sources as possible before rendering judgement, and you can be careful and thoughtful in assigning performance targets and metrics. But at the end of the day, you have to live with the fact that your judgement will always be unfair, and there will be people you prize that have done little to earn it, and there will be others who you have fired despite them quietly being the most valuable member of your team. Such is the role of a manager.

P.S.: I find it interesting how all of this is very similar to Specification gaming, which is a major problem of AI safety and alignment research. See also: Good and evil genies: the AI alignment problem

Consider that some of the most complex algorithms in existence span only a few hundred lines of code, if even that, and can take weeks or months to create; however, completely routine boilerplate code, such as what makes up most websites (e.g. routine CRUD websites) and tests take pretty much only as long to create as it takes to type them.

Some coding styles also naturally lend themselves to more lines of code; for instance, when writing if (condition) { do-stuff } some styles prefer to put the { on its own line, while others prefer to keep it at the end of line with if (condition).

It’s even worse if you try to compare code length in different languages: C++ is extremely verbose compared to Python, for example, and the exact same functionality (if not performance) may take 2-3 times as much code to achieve in C++ as in Python. This is one of the reasons why C++ is less popular than Python. ↩︎