16
Can Java and C++ devs write better Python code than Python devs?
Each person has their own, unique style
I read and summarise software engineering papers for fun, and today we’re having a look at Do Java developers write better Python? Studying off-language code quality on GitHub (2018) by Horschig, Mattis, and Hirschfeld.
Things can often be coded in different ways. For instance, you can use different algorithms, use fewer or more lines of code, implement functionality using different libraries or frameworks, or use a certain code style.
Most programming language communities have coding conventions. These conventions ensure that code written by different people looks similar. This can make code more readable, less prone to errors, and more maintainable.
Spend enough time with a language, and you will eventually be able to apply all of a language’s conventions effortlessly.
However, each language has its own coding conventions (*). So what happens when you switch to a different language? You might write code that’s less maintainable or more prone to errors. Or maybe you’re actually able to write better code, because your new language has fewer (or worse) conventions.
(*) And in some cases there are actually multiple sets of conventions!
Well, let’s find out what happens!
A very large part of today’s open source development happens on GitHub. GitHub provides an API that can be used to retrieve data about its platform, but there is (or was) also a GHTorrent project that mirrored GitHub’s (public parts of) repositories, user profiles, commits, issues, and other artifacts.
The researchers used the latter to look for developers who have made a large number of contributions in their primary language, and a much smaller number in some secondary language. We can treat these developers as the experimental group. We also need a control group; that one consists of users that only contributed using one programming language.
Then, the researchers mined the dataset for projects that were edited by developers using their secondary language.
For this study, they looked at Python projects that were edited by Java and C++ developers. These are compared to Python projects that were only edited by Python developers.
To study the effect of language switching, all projects were analysed using Pylint, which can find various types of issues in Python code:
- fatal errors that result in code that doesn’t work at all;
- errors that cause runtime errors when the code is executed;
- warnings for code that is error prone or has severe style issues;
- refactoring hints for complex or messy code; and
- violations of coding conventions.
The analysis ended up including data for 84 Java developers, 91 C++ developers, and 100 Python developers.
The table below shows the differences in code quality per issue type (lower is better):
Code quality issue | Java group | C++ group |
---|---|---|
Line too long | 3.59 | 1.44 |
Invalid name | 1.43 | 1.52 |
Wrong import order | — | 1.83 |
Ungrouped imports | 0.16 | 0.14 |
Bad whitespace | — | 0.38 |
Unnecessary semicolon | 4.42 | 20.62 |
Redefining built-in names | 0.57 | — |
Bad indentation | 3.39 | 3.28 |
Redefining outer name | 1.68 | 2.21 |
Undefined loop variable | — | 3.28 |
Unused import | 0.63 | 0.81 |
Unused variable | 1.56 | 2.25 |
Complex method/function | 0.84 | 1.48 |
Too many public methods | 0.26 | 0.46 |
Too few public methods | 0.34 | 0.58 |
No else return | — | 1.52 |
Undefined variable | — | 1.55 |
Assignment from no return | 28.27 | — |
What might be surprising is that Java/C++ developers sometimes write better code than Python developers. The researchers provide the following explanations for each individual result:
Line too long: Python lines should not be longer than 80 characters. C++ and Java developers tend write lines that are longer than that.
Invalid name: Class names in Python should be CamelCased, while method and field names should be snake_cased. Programmers from the other two languages regularly violate these naming conventions.
Wrong import order: Module imports should be ordered such that standard libraries are imported first, followed by third-party libraries, and finally local imports. C++ developers violate this convention a lot more often, but Java developers seem to do the same thing as Python developers.
Ungrouped imports: Multiple imports from the same package should be grouped together. Java and C++ developers do this way more often than Python developers.
Bad whitespace: C++ and Java developers are less likely to miss or add too much whitespace around operators, brackets, and blocks than Python developers.
Unnecessary semicolon: Python doesn’t need semicolons at the end of lines, but (especially) C++ and Java developers tend to add them anyway.
Redefining built-in names: Developers may accidentally use variable names which are already used for existing names (e.g.
input
andstr
). This may cause unexpected or confusing errors. Java developers do this less often than Python developers, despite being less familiar with the language. This is probably because they use IDEs (which would point out such mistakes) rather than simple text editors.Bad indentation: Whitespace is important in Python, so it helps if tabs and spaces are used consistently. Java and C++ developers aren’t as good at this as Python developers.
Redefining outer name: Shadowing names from outer scopes is discouraged in Python, but both Java and C++ developers do this more often than Python developers.
Undefined loop variable: Using loop variables outside the loop can be useful in some situations, but only when the loop was actually executed. C++ developers are 3 times more likely to write code with potentially undefined variables.
Unused import: Both Java and C++ developers are less likely to have unused imports in their files.
Unused variable: On the other hand, Java and C++ developers are more likely to forget about previously defined variables.
Complex method/function: C++ developers are more likely to write methods or functions with a cyclomatic complexity above 10.
Too many public methods: Java and C++ developers tend to make smaller classes and thus don’t run into this issue as often.
Too few public methods: The opposite, where classes are merely used as glorified data structures without any behaviour of their own, also occurs less often with Java and C++ developers.
No else return: Having an
else
statement after anif
is considered bad style. C++ developers use this more often than Python developers.Undefined variable: Undefined variables are often not reachable right now, but might become reachable when the code is modified in the future and thus cause errors later. C++ developers are more likely to write code with undefined variables.
Assignment from no return: Java developers are more likely to use “void” functions in assignments or as expressions, possibly because these would have been checked in Java during compilation – but not in Python.
16