Make the most of PyLint

Static code analysis is the process of detecting flaws in software source code.  The static analysis tools are useful to detect common coding mistakes; here are some benefits from using them:

  • Make the code source more readable and maintainable.
  • Prevent unexpected behavior at runtime.
  • Optimize the execution.
  • Make the code more secure.

In the Python world PyLint is the most popular tool to detect the issues in your python code base. Several ways exist to explore the result of PyLint

 Text format: Text files could be generated from PyLint, and it can be used to create a customized text report or used by another tool to explore the analysis result.

• HTML format: HTML report is a very suitable way to present the PyLint issues; it can be stored in a server and shared by the team.

• IDE Plugins: Many PyLint plugins exist to explore the issues from the IDE ( VsCode, PyCharm,…).

Let’s discover another way to explore and uses PyLint issues. It’s by using the Scanyp tool which is free for students and OSS contributors. For that let’s analyze the TensorFlow library with Scanyp.

1) Query the issues  with CQlinq

CQLinq permits us to query issues like a database. For example, you can get all the PyLint issues:

Or get the most recurrent issues:

Bad indentation issues are the most reported by PyLint. However, having thousands of issues is not interesting for developers. Sometimes, it’s preferable to ignore not priority issues like the bad indentation one.

Moreover, it’s interesting also to identify the classes having most issues:

The previous query is interesting, but it does not give us exactly the classes with lack of quality, another useful metric to take into account is the NBLinesOfCode. We can modify the previous request to calculate the ratio between the Issues count and the NBLinesofCode.

We can also search for the most used methods having issues. Bugs in such methods must have a high priority to resolve.

2) Generate Issues Trend

Having issues in a project is not an exception; any project could have many problems to resolve. However, we have to check the quality trend of the project. Indeed it’s a bad indicator if the number of issues grows after changes and evolution. Scanyp provides the Trend Monitoring feature to create trend charts.

Trend charts are made of trend metrics values logged over time at analysis time. More than 50 trend metrics are available per default and it is easy to create your own trend metrics.

With this trend chart we can monitor the evolution of the PyLint issues:

3- Generate custom HTML report

Scanyp makes possible appending extra report sections in the HTML report that lists some CQLinq queries.
In the CQLinq Query Explorer panel, a particular CQLinq reported group is bordered with an orange rectangle.

And in the HTML report these added sections are accessible from the menu:

4- Integrate PyLint into the build process

A Quality Gate is a check on a code quality fact that must be enforced before releasing and eventually, before committing to source control. A Quality Gate can be seen as a PASS/FAIL criterion for software quality.

Quality Gates can be used to fail the build when certain criteria are not-verified.

Quality Gate is a LINQ Query that can be easily created, edited, and customized. For example, if you wish to enforce a certain amount of code coverage through a Quality Gate, you can just write:

A dozen default Quality Gates are proposed by Scanyp related to measures like technical debt amount, code coverage or amount of issues with particular severity.


At Build Process time, when a quality gate fail the process Scanyp.Console.exe returns a non-zero exit code. This behavior can be used to break the Build Process if a critical rule is violated.

Summary

Scanyp is open to other static analysis tools, and you can also plug your customized tool easily. This way you can use all the Scanyp features to explore better the result from the known python static analysis tools.

Clean python code: NumPy case study

Every project has its own style guide. Some managers choose basic coding rules, others prefer very advanced ones and for many projects, there are no coding rules, and each developer uses his own style.

It is much easier to understand a large codebase when all the source code is in a consistent style.

Many resources exist talking about the better coding rules to adopt, we can learn good coding rules from:

  • Reading a book or a magazine.
  • Web sites.
  • From a colleague.
  • Doing training.

Another more interesting approach is to study a known and mature open source project to discover how developers implement their code. This case study will explore the NumPy library and discover some facts about its source code.

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Let’s first begin by a mini function from the NumPy code:

Here are some facts about this function:

  • The function has few local variables.
  • The function has few parameters.
  • The function exists as early as possible.
  • The variable naming is easy to understand.
  • The method is short.
  • No extra comments in the body.
  • The errors are managed by exceptions.
  • If statement conditions are easy to understand.

But how many methods are clean like this one?

To answer this question let’s use Scanyp, a static analysis tool to go deep inside the NumPy source code.

Scanyp uses CQLinq to query the code like a database to easily get interesting info about the code in order to improve it. Let’s search for complex methods using CQLinq

Only 171 out of 10 972 methods could be considered as complex whether it’s big or has high cyclomatic complexity. We can also discover visually the complex methods, for that we can use the treemap view.

Treemapping is a method for displaying tree-structured data by using nested rectangles. Treemap rectangles represent code elements. The size of a unit rectangle is proportional to the number of Lines of Code and the color is proportional to its complexity.

In case of NumPy the red rectangles are the complex methods.

Functions with many parameters

Function with many parameters might be painful to call and it become more complicated to understand and maintain. Another alternative is to provide  a structure dedicated to handle arguments passing

Functions with many local variables

Methods where NbVariables is higher than 8 are hard to understand and maintain. Methods where NbVariables is higher than 15 are extremely complex and should be split into smaller methods (unless they are automatically generated by a tool).

Too many boolean expressions in if statement

When the if condition has many boolean expressions, the test become hard to understand and maintain. This issue is detected by PyLint. Indeed, beside the issues detected by Scanyp we can import the issues from any other static analysis tool and benefit from the code query language to query the external issues.

Scanyp embed out of the box PyLint and report all its issues, and here’s the result for the “Too many boolean expressions in if statement” rule.

Only 3 conditions has this issue, including this one:

Exception Handling

Exceptions provide a powerful way to separate the details of what to do when something out of the ordinary happens from the main logic of a program. However, using exceptions in a wrong way could introduce bugs in your code.

NumPy satisfies almost all the exception handling best practices, only a few issues remains, specifically for the “Catching too general exception” rule.

Statements implementations issues

To have a clean code, there are many best practices concerning the statements implementation, PyLint provides many interesting rules to avoid a bad implementation.

In case of NumPy only few issues remains, and they can be resolved quickly:

File formatting

Well formatted file help developers to easily understand and maintain an existing code base.

Concerning NumPy only some indentations and spaces issues are present in its source files, but nothing serious.

Design implementation

Classes with too many methods

Classes with too many methods may be trying to do too much, or in any case may be more difficult to maintain. Twenty is a reasonable threshold for investigation.

In the case of NumPy, almost all the classes concerned by this issue are test classes. Which is not a big issue.

Classes with too many fields

As with methods, having too many fields can indicate a maintenance issue. Twenty is a reasonable threshold for investigation.

And as with methods, almost all the classes concerned by this issue are test classes.

Inheritance: Too many ancestors.

High depth of inheritance indicates a more complex object hierarchy, and the more unique types a class references, the less stable it is, since any changes to any of these referenced types can break the class in question.

Only 8 classes are concerned by this issue.

Cohesion

The single responsibility principle states that a class should not have more than one reason to change. Such a class is said to be cohesive. A high LCOM value generally pinpoints a poorly cohesive class. There are several LCOM metrics. The LCOM takes its values in the range [0-1]. The LCOM HS (HS stands for Henderson-Sellers) takes its values in the range [0-2]. A LCOM HS value highest than 1 should be considered alarming. Here are  to compute LCOM metrics:

LCOM = 1 – (sum(MF)/M*F)
LCOM HS = (M – sum(MF)/F)(M-1)

Where:

  • M is the number of methods in class (both static and instance methods are counted, it includes also constructors, properties getters/setters, events add/remove methods).
  • F is the number of instance fields in the class.
  • MF is the number of methods of the class accessing a particular instance field.
  • Sum(MF) is the sum of MF over all instance fields of the class.

The underlying idea behind these formulas can be stated as follow: a class is utterly cohesive if all its methods use all its methods use all its instance fields, which means that sum(MF)=M*F and then LCOM = 0 and LCOMHS = 0.

LCOMHS value higher than 1 should be considered alarming.

Only few classes could be considered as not cohesive.

PEP8 conformance

PEP 8 is a document that provides guidelines and best practices on how to write Python code. It was written in 2001 by Guido van Rossum, Barry Warsaw, and Nick Coghlan. The primary focus of PEP 8 is to improve the readability and consistency of Python code.

Except the naming convention rule where NumPy is not compatible with Pep8, it’s satisfy all the other rules. Only few issues remains concerning Block comment.

Conclusion:

Exploring some known open source projects is always good way to elevate your programming skills, no need to download and build the project, you can just discover the code from GitHub. NumPy is one of the mature python libraries with a clean code and I encourage any python developer to take a look inside its source code.