Chuniversiteit.nl
The Toilet Paper

Do Java developers write better Python? Studying off-language code quality on GitHub

What happens when you let Java and C++ developers write Python code?

Two artists try to paint the Python logo. One looks like the Java logo, the other like the C++ logo.
Each person has their own, unique style.

Things can often be coded in different ways. For instance, you can use different algorithms, use fewer or more lines of code, implement functionality using different libraries or frameworks, or use a certain code style.

Why it matters

Most programming language communities have coding conventions. These conventions ensure that code written by different people looks similar. This can make code more readable, less prone to errors, and more maintainable.

Spend enough time with a language, and you will eventually be able to apply all of a language’s conventions effortlessly.

Note that . So what happens when you switch to a different language? You might write code that’s less maintainable or more prone to errors. Or maybe you’re actually able to write better code, because your new language has fewer (or worse) conventions.

Well, let’s find out what happens!

How the study was conducted

A very large part of today’s open source development happens on GitHub. GitHub provides an API that can be used to retrieve data about its platform, but there is (or was) also a GHTorrent project that mirrored GitHub’s (public parts of) repositories, user profiles, commits, issues, and other artifacts.

The researchers used the latter to look for developers who have made a large number of contributions in their primary language, and a much smaller number in some secondary language. We can treat these developers as the experimental group. We also need a control group; that one consists of users that only contributed using one programming language.

Then, the researchers mined the dataset for projects that were edited by developers using their secondary language.

For this study, they looked at Python projects that were edited by Java and C++ developers. These are compared to Python projects that were only edited by Python developers.

To study the effect of language switching, all projects were analysed using Pylint, which can find various types of issues in Python code:

  • fatal errors that result in code that doesn’t work at all;
  • errors that cause runtime errors when the code is executed;
  • warnings for code that is error prone or has severe style issues;
  • refactoring hints for complex or messy code; and
  • violations of coding conventions.

What discoveries were made

The analysis ended up including data for 84 Java developers, 91 C++ developers, and 100 Python developers.

The table below shows the differences in code quality per issue type (lower is better):

Code quality issueJava groupC++ group
Line too long3.591.44
Invalid name1.431.52
Wrong import order1.83
Ungrouped imports0.160.14
Bad whitespace0.38
Unnecessary semicolon4.4220.62
Redefining built-in names0.57
Bad indentation3.393.28
Redefining outer name1.682.21
Undefined loop variable3.28
Unused import0.630.81
Unused variable1.562.25
Complex method/function0.841.48
Too many public methods0.260.46
Too few public methods0.340.58
No else return1.52
Undefined variable1.55
Assignment from no return28.27

What might be surprising is that Java/C++ developers sometimes write better code than Python developers. The researchers provide the following explanations for each individual result:

  • Line too long: Python lines should not be longer than 80 characters. C++ and Java developers tend write lines that are longer than that.

  • Invalid name: Class names in Python should be CamelCased, while method and field names should be snake_cased. Programmers from the other two languages regularly violate these naming conventions.

  • Wrong import order: Module imports should be ordered such that standard libraries are imported first, followed by third-party libraries, and finally local imports. C++ developers violate this convention a lot more often, but Java developers seem to do the same thing as Python developers.

  • Ungrouped imports: Multiple imports from the same package should be grouped together. Java and C++ developers do this way more often than Python developers.

  • Bad whitespace: C++ and Java developers are less likely to miss or add too much whitespace around operators, brackets, and blocks than Python developers.

  • Unnecessary semicolon: Python doesn’t need semicolons at the end of lines, but (especially) C++ and Java developers tend to add them anyway.

  • Redefining built-in names: Developers may accidentally use variable names which are already used for existing names (e.g. input and str). This may cause unexpected or confusing errors. Java developers do this less often than Python developers, despite being less familiar with the language. This is probably because they use IDEs (which would point out such mistakes) rather than simple text editors.

  • Bad indentation: Whitespace is important in Python, so it helps if tabs and spaces are used consistently. Java and C++ developers aren’t as good at this as Python developers.

  • Redefining outer name: Shadowing names from outer scopes is discouraged in Python, but both Java and C++ developers do this more often than Python developers.

  • Undefined loop variable: Using loop variables outside the loop can be useful in some situations, but only when the loop was actually executed. C++ developers are 3 times more likely to write code with potentially undefined variables.

  • Unused import: Both Java and C++ developers are less likely to have unused imports in their files.

  • Unused variable: On the other hand, Java and C++ developers are more likely to forget about previously defined variables.

  • Complex method/function: C++ developers are more likely to write methods or functions with a cyclomatic complexity above 10.

  • Too many public methods: Java and C++ developers tend to make smaller classes and thus don’t run into this issue as often.

  • Too few public methods: The opposite, where classes are merely used as glorified data structures without any behaviour of their own, also occurs less often with Java and C++ developers.

  • No else return: Having an else statement after an if is considered bad style. C++ developers use this more often than Python developers.

  • Undefined variable: Undefined variables are often not reachable right now, but might become reachable when the code is modified in the future and thus cause errors later. C++ developers are more likely to write code with undefined variables.

  • Assignment from no return: Java developers are more likely to use “void” functions in assignments or as expressions, possibly because these would have been checked in Java during compilation – but not in Python.

Summary

  1. Java and C++ developers are less familiar with Python conventions, but might (inadvertently) write cleaner Python code anyway

More about code quality

More about programming