A version control system like Git maintains a record of code changes in the form of commits. Each commit contains changes to source code (and possibly other artefacts) and a message that describes the changes. This allows collaborators to understand the context of the change and its impact on the project. For long-lived projects, commit messages might be the only source of information left for future developers who wish to understand what changes were made and why.
This is also basically the tl;dr: your commit messages should communicate what changes are made and why. Of course it’s a bit more nuanced than that, so keep on reading if you want to learn more.
In practice the quality of commit messages varies wildly, often . A previous study found that about 14% of commit messages in 23,000 open-source projects were completely empty and as many as 75% only contained a few words. A mere 10% of commits had messages with “normal” English sentences!
The authors manually classified 1,597 commits from five major open source Java projects into four types: 1) why and what; 2) what, but no why; 3) why, but no what; and 4) neither why nor what. Although the first type is the most common, the latter three types still make up about 44% of all commit messages.
The why is most often left out of messages, presumably because developers find it more challenging to describe the rationale behind their changes.
A very small portion of commit messages does not contain any useful information. These can be grouped into five categories:
- Single-word messages, like “merge”, “polish” or a file name;
- Submit-centred messages that simply express the fact that the commit “changes” something;
- Scope-centred messages which primarily convey the size of the change, e.g. “minor change”;
- Redundant messages that repeat information that’s already in the diff;
- Irrelevant messages that .
The why and what should be clear for each commit, but that doesn’t mean that they need to be expressed explicitly. Both the why and what can be omitted when the reason for a change is common sense or can be explained by the change itself.
The authors identified :
Describe issue: Commits in this category directly describe the motivation of a code change. This can be done by describing an error scenario, citing errors or warnings from quality assurance tools, or describing shortcomings or weaknesses in the current implementation.
Illustrate requirement: A message can also describe the requirements that led to the commit, e.g. user needs, obsolescence of features, or a change in the runtime or the environment.
Describe objective: Some commit messages are more forward-looking and describe the purpose of the change, e.g. to fix a defect or improve the code in some way.
Imply necessity: Commit messages can describe the need for changes in an indirect way, for instance by mentioning conventions or standards, how it relates to a previous commit or a bigger change, or the benefits that a change might bring.
Missing why: In some cases the rationale is common sense or can be easily inferred, e.g. when adding test cases, fixing typos, updating text, annotations or version numbers, or refactoring code.
They also found four types of “what” expression categories:
Summarise code object change: Commit messages can summarise the changes in a commit. This can be done by highlighting characteristics of the change, summarising the change, describing the “before” and “after” states of the code, or simply by listing the changes.
Describe implementation principle: A commit message can describe the technical principles that underpin the changes. This type of description isn’t seen very often.
Illustrate function: Messages in this category explain code changes from a functional or behavioural perspective. This is one of the more common categories.
Missing what: Changes that are small and simple, like the correction of typographic errors, do not require a specification of what has changed.
These nine expression categories are not evenly distributed over . The table below shows how often each expression category type occurs with each major type of maintenance activity. This can be useful for those who are not sure what to write in their commit message. First determine the type of change you’re making, then make sure that your message contains at least the two most common “why” and “what” categories for that type!
|Category||Corrective, N=116 (%)||Adaptive, N=63 (%)||Perfective, N=73 (%)|
|How to express “Why”||Describe issue||45.7||12.7||6.9|
|Describe issue & Describe objective||0.8||0.0||0.0|
|Describe issue & Imply necessity||2.6||0.0||0.0|
|Illustrate requirement & Imply necessity||0.8||1.6||0.0|
|How to express What||Summarise code object change||58.6||60.3||76.7|
|Describe implementation principle||4.3||1.6||0.0|
|Summarise code object change & Illustrate function||8.6||7.9||1.4|
Once you know all this, it’s very tempting to build a classification model that can automatically determine the quality of commit messages. This just so happens to be the final contribution of this study. The authors used several techniques and found that models based on Bi-LSTM have the best performance on classifying whether a commit message describe the why and what of a change. The accuracy is reportedly somewhere between 75.9 and 91.0 percent, but sadly there doesn’t appear to be a way to use these models yourself.
Commit messages should communicate the why and what of a change
Both the why and the what can be implied under certain circumstances