Numbers are not reliable for developers performance review

I guess performance review is done in many companies around the world. It’s important to improve employees’ skills and behavior so that the company makes more profit. If a boss or a president finds employees who are incredibly low performances compared with the average in the company, they might start thinking of firing or decreasing the salary.

However, do you, as a software developer, feel that the system for the performance review is good enough for you? Don’t you have any complaints? Haven’t you ever thought setting goals for the year is hard and I don’t know which number can be used as a standard for the review?

I struggle with setting the goal whenever I need to write it on the sheet provided by the company. I recently had a session for the performance review in my company. I couldn’t receive any answer to my questions, which I’ve been wondering for a long time, which number can be used for the software developers for the review. Therefore, I try to write the issues and consider the solutions.

I’ve been in a Japanese company since my career started. Even though I’m currently in German in one of the group companies of the Japanese company, the performance review is done only for the Japanese company.
Therefore, I don’t know how it’s done in either the other Japanese companies or other companies in non-japanese countries.

Why is performance review needed in the first place

First of all, let’s consider why it is necessary in the first place.

Determine how to distribute the money

I guess it’s normal in Japan to determine whether the salary increases or decreases depending on the result of the performance review. It also affects the bonus compensation. A company needs to determine how to divide up the money.

The performance review result is a key to doing it without receiving any complaints from the employees. If all employees have their points, the money can easily be divided up. If it depends on how much they contributed to the company, they probably don’t complain about it.

Improve skill set

What we are feeling against our own tasks is different from what others are feeling. What we personally want to achieve differs from what the project wants to achieve. Our personal goals are different from companies’ or projects’ goals.

These gaps somehow need to be filled at some points. The performance review is one of the chances for it. Both good and not good things come up in the review. It gives us the way to know which skills we should boost and which skills or mindset we should improve. This is well known as PDCA. We can adjust the direction of which way to go for our careers and skill set by repeating this process.

How often do you have the performance review per year

It depends on the company but I guess it’s once or twice a year in many companies. Do you remember what you did 3 months or 6 months ago? You perhaps remember but I guess the memory is not fresh. Do you remember what your team members did? You might remember the fragments of it.

In this situation, it’s not easy to give good feedback. If the quality of the feedback is not good, the receiver might not be able to take good action to improve them because the feedback is abstract and for the long term period.

I mentioned in the last section that we can improve our directions by repeating PDCA process. However, if the performance review is done only once or twice a year, it’s hard to improve it. It’s not impossible to improve but the speed of the improvement is like a turtle.

We set our goals at the beginning of the year or third quarter. Are the goals still valid 3 months later? No one knows what happens in the future. We create our own schedules at the beginning for our projects but they probably change because business requirements flow quickly today. We developers often receive new requirements suddenly that could break the original schedule. Considering the current business flow, a performance review once or twice a year doesn’t seem to fit in.

Numbers that might be used for the evaluation

We set goals with some numbers or standards that we can check later whether we could reach the goals or not. Then, what number can be used for the evaluation? I give some examples below. Let’s consider them one by one.

Number of bugs

The fewer bugs they make, the better the review score they get. Is it good? I don’t think so. The bugs can be categorized at different levels. If the bug causes the application to crash, it’s a fatal error. If the bug occurs only when a user gives an irregular input that the application doesn’t expect, it’s probably allowed to keep the behavior for a while. Those small bugs that don’t break main functionalities normally exist more than the number of fatal errors. If a fatal error exists in an application, the small bugs can’t be found in the first place because the application doesn’t work as expected.

Even if the review point is weighted depending on the bug level, as I mentioned above, the point can’t be calculated correctly because the small bugs hidden under the fatal error might have not been found yet.

It’s also difficult to determine who made the bugs. Why? Let’s consider this case.

Developer A implemented a function without any bug
Developer B added a feature to the same function
A bug is found here

Who made the bug? At first glance without thinking about the code, we might want to say “Developer B”. Is it correct? The code that Developer A wrote might be awful and he did not write unit tests. If the code is difficult to read without unit tests, other developers struggle with it and likely introduce bugs. Again, who made the bug?

In my opinion, handling bugs is the responsibility of the team but not the individual. Did everyone on the team has the same picture? Did a reviewer check the code well? Was the communication within the team good enough?

The developers who work with legacy code could make more bugs than the other developers who work with clean code. In addition to that, the bigger the application is, the more the bugs lurk. If a developer writes 100 lines, there might be zero bugs in the application. It could happen if the developer needs to do many things like writing specifications, code, management, user support, etc…

If the number of bugs is used as a standard for the review, it’s impossible to evaluate properly.

Number of code lines

Knowing this is harder than the number of bugs. The code that we wrote will be amended next time we need to add a feature or change the behavior. Even if we add 100 lines in a function, it can change to 30 after refactoring.

If we are familiar with the design pattern, the code is optimized and thus the code might have fewer lines. It necessarily doesn’t have fewer lines by design pattern but if we pay attention to DRY, the number of lines decreases.

If we know lots of external packages/tools, we don’t have to implement the functionalities in the first place. We can reduce a bunch of lines in this case.

For these reasons, the number of lines can’t be used for the evaluation.

Number of PRs

I think this can be used for evaluation. If the coding speed is faster than other developers, we open more PRs than the others. But if this is used for the evaluation, we need to be aware of other factors.

If the tasks on which we took are easy tasks, they should be done in a short time. Then, we open the PR. If we take only easy tasks throughout the year, we have more PRs than others.

How about difficult small tasks? It takes a while to complete the task and we have fewer PRs in the end. If the task is large, we might be able to break it into small tasks. In this case, we open more PRs.

Let’s consider the case that we are reviewers for PRs.

Reviewing many PRs sounds nice. There must be a small task, big task, easy task, and difficult task in those PRs. It seems to be used for the evaluation. However, if the reviewer doesn’t spend time on the review, its quality might be low. Considering to evaluate this number, the reputation against the review from other developers needs to be taken into account.

Reputation of the code Reviews

Let’s consider when we are reviewees.

It’s not necessarily needed but if possible, we should tell what to review and which part we want the reviewer to focus on. If we tell these points to the reviewer, it’s clear for them to know what to do.

When it comes to code readability, cyclomatic complexity could be used. If such an analysis tool is not introduced in the project, it’s challenging to evaluate the code itself because the code readability depends on a developer. We all might have the big same picture for the readable code but for small pictures, everyone has their own preference. If the number of team members is 3, the review score is biased compared with another team where there are 6 members.

How about being a reviewer?

If we review only one PR and the reviewee gives us a good point for the evaluation, it’s not reliable. Even if the number of PRs that we reviewed is big, if we review PRs opened by one person, it doesn’t make big difference. Both the number of PRs and people need to be taken into account to get rid of the bias.

Some might think this case. If there is only one expert and 5 beginners in the team, the expert might have a better score than others. But this reputation is not decided only by the code. It’s rather a communication skill. If the expert can’t tell the solutions to other developers well, 5 beginners can’t understand it and give a low score. If beginners don’t understand something but ask them, we might admire the mindset.

Mapping the impression to a review point is challenging. If it’s possible to create a list for the standard, it’s wonderful but if not, let’s leave it to a boss.

This is still not perfect but it seems to be better than the other standards.

Code performance improvement

Code performance? It’s significant but when it comes to improvement, it’s also complex. Of course, it’s excellent to be able to find the hidden issues, the root causes, and the solutions. There is no doubt about it. However, how can we compare these two developers?

Who knows O notation and scalability and be aware of it from the beginning
Who just implements without thinking about growing the scale

The former might not be a better score than the latter, since the issue doesn’t come up and he has no chance to improve it. The former developer is definitely much better than the latter. If the original code is written well enough, the performance doesn’t become a topic.

Even if we apply absolute evaluation to this issue, a good developer doesn’t have a good feeling if others get a good score because of no chance to the topic.

Scheduling tasks to achieve the goals

Scheduling is necessary. We developers might have a schedule for the year. If we make progress along with the schedule, it’s excellent. However, I say it’s impossible to do it along with the original schedule because the business flow is fast and the requirements change quickly.

We developers don’t have the right to decide the schedule. A boss takes the handle and what we can do is just tell, “let’s do this first for this reason”. We don’t know if the boss agrees with the opinion. The decision is made by the boss in the end.

If there are lots of interruptions in the year, it’s impossible to judge whether the tasks are done according to the original schedule.

Recent bad event has a big impact on the review score

We filled out the sheet for the performance review and it’s time to review. What if we might make some mistakes to cause some incidents? What does our boss think about it? I’ve read a book that said “A negative thing is 3 times stronger than a positive thing“. If an accident happens before the performance review, it probably affects the review score because a negative thing is powerful.

What if the boss is not in good mood for some reason? It might also affect the score. If these events happen, it’s unlucky. We want to avoid this case as much as possible.

What we can do for the better performance review

It’s of course difficult to evaluate properly without receiving any complaints, but let’s consider better ways.

Evaluate as a team rather than individual

Looking at big pictures is easier than looking at the details. An art fascinates us at first sight although we don’t have knowledge about it. I think this could apply to the review. Each member is detailed. It is not possible to evaluate every single skill of an individual.

A product is made by several teams. A functionality is made by a team. Then, we should evaluate the team instead of individuals. A team can be seen as an art. It’s easier to compare teams than to compare individuals. If the evaluation is done against the whole team, all members actively communicate with each other to improve the score. All members definitely have a good idea to improve the development cycle and actively help each other.

Set a goal and give a feedback frequently

I mentioned the score might be affected by the team member, the feeling of the boss. If it’s a low sampling rate, it definitely happens. If it’s a high sampling rate, such a case is less likely to happen.

Setting one or two goals every 2 weeks or 4 weeks, and having time for the review. The quality of the feedback becomes much better than doing it once per year. If we have many review results for the year, we can average it and get rid of the bias.

2 or 4 weeks is not far future. Therefore, it’s easier to set goals and focus on them in the period. After receiving feedback, we can easily take an action because the feedback is of better quality and we are likely on the same track as the last period.

If agile development is applied to the team, it is easy to do it.

Try 360 degree feedback

We know how good other team members are if we all work as a developer. We know which skills they excel in and which do not. If we have enough reviews from expert developers, the score is more reliable. The feedback could be detailed since we work together every day.

The problem is it’s difficult to review if the team is small, for example in the case there are only 2 or 3 members.

Some might think, “The intermediate programmer might likely have a better score than experts in other teams if there are only beginner programmers in the team”. It could be true. A list should be created to know from which point of view reviewers should check. It’s still not perfect even if we have the list but it probably reduces the bias.

Conclusion

The performance review is hard work for everyone. I always struggle with setting my goals. We don’t have to introduce all the 3 proposals above at once. It’s one option to take only one proposal and take another next time.

There is no one best way that fits all organizations. Each organization needs to consider the framework and identify the method that fits it.