Belief & evidence in empirical software engineering (2016)
Programmers can hold very strong beliefs about certain topics. Devanbu, Zimmermann, and Bird conducted a case study on such beliefs at Microsoft, and found that they are mostly based on personal experience rather than empirical evidence, and don’t necessarily correspond with what happens in reality.
Why it matters
Programmers – like other humans – believe all sorts of things, e.g. that shorter functions are more maintainable than longer ones.
These beliefs can change over time, as we learn from new discoveries. However, we’re less likely to accept new discoveries if they strongly contradict what we already believe, unless the evidence is really, really convincingThere are two major views of statistical analysis. The frequentist view is entirely observation-based and argues that probability is the only thing that matters. Bayesians also take prior beliefs into account, e.g. given two contradicting experimental results with an identical p-value, the least surprising result is more convincing..
The consequences of ignoring actual empirical evidence can be severe: imagine how you’d feel if you were diagnosed with acute appendicitis, and your doctor would only use homeopathic treatments simply because she “believes” that works best!
Fortunately this doesn’t really happen in medicine; organisations like Cochrane and the American College of Physicians have made it easy for practitioners to access and keep up with the latest empirical findings.
Software engineering isn’t quite there yet, as this study shows.
How the study was conducted
The study consists of two parts.
In the first part, a list of empirically falsifiable claimsYou can find them in the table below was compiled and presented in a survey among 564 programmers at Microsoft. Each respondent was asked to rate for each of the claims how much they agreed with it on a 5-point Likert scale.
After analysing the results from the survey, the authors selected a single claim that turned out to be highly controversial and conducted a case study within Microsoft to determine whether the claim is true or false.
What discoveries were made
Results for the survey and the case study are presented separately.
The table below shows the programmers’ opinions about each claim; scores are on a scale of 1 (
strongly disagree) to 5 (
strongly agree), and variance is a measure of disagreement between respondents:
|Code quality (defect occurrence) depends on which programming language is used||3.17||1.16|
|Fixing defects is riskier (more likely to cause future defects) than adding new features||2.63||1.08|
|Geographically distributed teams produce code whose quality (defect occurrence) is just as good as teams that are not geographically distributed||2.86||1.07|
|When it comes to producing code with fewer defects specific experience in the project matters more than overall general experience in programming||3.5||1.06|
|Well commented code has fewer defects||3.4||1.05|
|Code written in a language with static typing (e.g., C#) tends to have fewer bugs than code written in a language with dynamic typing (e.g., Python)||3.75||1.02|
|Stronger code ownership (i.e, fewer people owning a module or file) leads to better software quality||3.75||1.02|
|Merge commits are buggier than other commits.||3.4||0.97|
|Components with more unit tests have fewer customer-found defects||3.85||0.95|
|More experienced programmers produce code with fewer defects||3.86||0.94|
|More defects are found in more complex code||4.0||0.93|
|Factors affecting code quality (defect occurrence) vary from project to project||3.8||0.92|
|Using asserts improves code quality (reduces defect occurrence)||3.78||0.89|
|The use of static analysis tools improves end user quality (fewer defects are found by users)||3.77||0.87|
|Coding standards help improve software quality||4.18||0.79|
|Code reviews improve software quality (reduces defect occurrence)||4.48||0.64|
Respondents that strongly agreed or disagreed with a claim were asked which sources they based their opinion on and which sources they valued most. The overall ranking is as follows:
- Personal experience
- Peer opinion
- Trade journal
- Research paper
Strength of social connection appears to have a greater influence on programmers’ opinions than credibility of empirical evidence.
This is consistent with earlier findings within general populations, but it’s not exactly how a professional should form opinions.
The case study
The authors examined the highly controversial claim
Geographically distributed teams produce code whose quality, viz., defect occurrence, is just as good as teams that are not geographically distributed
in more detail, using data from two large projects within Microsoft: an operating system and a Web service.
Both projects have a similar number of programmers and are developed by teams that are distributed around the world.
Curiously enough, survey respondents that were part of the operating system project overwhelmingly disagreed with the claim, while respondents from the Web service project largely agreed with it.
Is there really a difference in quality between the two projects?
The authors counted the number of bug fixes that were applied to each of the projects’ files and then determined whether 75% of commits were made within the same building, city, region, or country.
Some control measures were introduced to make sure that they really were measuring the effect of geographic distribution:
- Average size in LOC: the larger a file is, the more likely it is to contain defects;
- Total number of commits: the more a file changes, the more likely it is that defects will be introduced;
- Number of distinct contributors: the number of programmers involved in a file is known to influence its quality;
- Percentage of commits made by most frequent committer: strong file ownership also tends to influence quality.
After running these variables through linear regression models, the authors found that the effect of distributed development is minimalWhich is consistent with prior studies on this topic.
For both projects, it’s very slightly better to be in the same building, while for the Web service project it’s also possibly very slightly better to be within the same city. In all other cases, the quality of files that have commits within the same geographic area is very slightly worse.
This shows that the programmers of the operating system project have formed beliefs that turn out to be false.