Skip to content

HTS Bioinf - Development and review

This document describes process, routines, and setup for development work.

Requests for modifications and issue-tracking

Day to day requests for modifications to existing pipelines and new "small" feature requests are continuously gathered and stored as issues in the issue-tracking system by all bioinformaticians. External user requests (from lab and doctors) outside meetings should be sent to the mailing list diag-bioinf. The bioinformatics group coordinator has the final responsibility for issues being created from requests received by email. Issue tracking is described in a separate document.

Development schedule and releases

Issues are prioritized by the bioinformatics group coordinator or others having a "product owner" role, by ranking them in the relevant issue boards. Development is split into regular iterations where a subset of issues are selected for focused work.

The aim is to update pipelines and related code in regular releases. Each system/repo has a given release responsible who plan and perform a release. There is also an overall release-responsible bioinformatician (the release coordinator) who plans releases at a higher level, and releases affecting the pipeline in particular.

Programming languages, tools and settings

Bioinformaticians primarily standardise on the following programming languages:

  • Python, Nextflow and Bash for pipeline/scripts. R if need be.
  • JavaScript and TypeScript with the React framework for web development.

Python is used as the general purpose language, with Pytest used for unit tests, Black for formatting with --line-length=100, and ruff for linting (VSCode extension). Poetry is used to handle package dependencies and virtual environments. For Bash, we run ShellCheck, for which there is a VSCode extension. Setup for web development is more comprehensive and is described in detail elsewhere.

Code is made to run in a Linux environment, but with an attempt to also accommodate MacOS users who have Coreutils installed. Containers (Docker and Singularity) are heavily used for both development and production work. Typically, each repo will have a Dockerfile with instructions on how to generate containers suitable for the repo. Automation to generate the container and run tests are typically in a Makefile, and make can be triggered manually or from Gitlab's CI in .gitlab_ci.yml.

Larger projects typically have a devcontainer. In our setup we favor use of Visual Studio Code.

To generate documentation we use Markdown and diagrams (e.g. C4 architecture diagrams stored as code in the form of Mermaid.js or PlantUML code, or as diagrams.net files.

The template Project Skeleton has an educational walk-through of how we set up projects that use Python, Docker, devcontainers for VSCode, a simple FASTAPI/React app with BDD tests, and a command line interface (CLI). This project acts as a reference and should be studied by all developers.

Version control and branching

Both code and reference data are version-controlled using the git version control system. The code is stored in different repositories and Gitlab.com is used as a remote/central repository and code control system. Three Gitlab organisations are currently in use:

  • DPIPE for public open source pipelines and systems
  • ELLA for public ELLA and ELLA-related projects
  • OUSAMG for a small number of private projects

Access to Gitlab is controlled by the System administrator. Core branches (main/master, dev) are protected. See below for more in-depth information on access control, and the document "How to start a new Gitlab project repo" for more details on the configuration of projects.

All development is done in separate git branches, using a "Gitflow" branching model. When a developer thinks that a feature branch is ready and the branch has been rebased against an updated dev branch, a merge request against the latter is issued in Gitlab, and, following a passing code review by another developer, the dev branch is updated with changes from the feature branch. The dev branch should be "protected" and as a rule not updated directly, but only through merge requests. Changes to protected branches require an approval in a Gitlab merge request by another team member with permission to approve. This usually means that approval must be done by someone having the role of code owner. The approval serves as documentation of the review. The main/master branch will only be updated during new releases, by merges from the dev branch.

Git commit messages

Take the time to write proper commit messages! See this guide and write a concise summary line (max 80 chars). If the commit is relatively small and atomic, it should not prove too difficult to write a specific message. Feel free to add more info after a blank line.

The first commit message in a new branch should mention the issue number.

Merge requests (MRs)

As a rule, every MR should be associated to an issue. The problem should be described in the issue, the proposed solution should be described in the MR. The issue should be created first. The branch addressing it and its associated MR can be created directly from the issue in the Gitlab UI.

Before filing an MR, consider interactively rebasing your feature branch to adjust your commits and messages and make them easier to read for the prospective reviewer. You can then squash some commits, split commits, re-stage changes that naturally belong together, and add better messages.

MRs should be as small as possible and change "one coherent thing", i.e. they should not be a random lump of massive changes. This will make them much easier to review. It is allowed to create more than one MR for a single issue.

MR description templates should be used. The template files themselves can be found in the .gitlab directories of established projects. Suggestions for improvements are welcome.

MR reviews are often much faster and easier to do if the author sits down with the reviewer and explains the changes. Use the opportunity to learn from each other.

Production code and versioning

Code running in production should be always identical to tagged releases from the main/master branch, and typically generated from those using git archive. If critical modifications are needed outside of the planned releases, a maintenance branch is forked directly off of the tagged commit in main/master. As soon as the fix is complete, the maintenance branch should be merged into both main/master and dev, and main/master should be tagged with an updated version number. We follow semantic version naming (i.e. major.minor.patch).

Generally, all new features and changes shall be tested, documented and approved. Before deploying a release, an "endringskontroll" must be compiled with a description of the changes, how testing was done, which data were used and who approved the test results. Only after the "endringskontroll" has been approved, the deployment can proceed. Write enough info in the issue descriptions so that the "endringskontroll" are easier to write. If sensitive information is involved, create a directory on TSD/NSC (on a common, known location). Put date and a short description in the directory name, like 20210820-pipeline-release-3.1. The documentation should be specific enough to allow for easy re-testing.

Automated testing

Repositories are typically configured with automated tests that run on every git push or merge. To standardise test environments, these tests run in docker containers that are built as part of the tests together with the code artifacts. Tests can also be run manually via the Gitlab interface or via make commands.

(Regression tests of the pipelines with static control samples (i.e. the same data every time) should be run before deployment in production. Regression tests with re-sequenced control samples (i.e. new data) are run from time to time).

Access to projects

Public projects can be read by anyone, but any modification to code or issues requires membership in the group or project. Only users with Owner role can add members to groups, whereas Maintainers can add members directly to individual projects. Only developers at OUSAMG or trusted collaborators should have membership. This should normally be controlled at group level, with users given Developer roles. Alternatively, direct access to a single project can be granted if more relevant. Note that as an additional security measure, the main/master and dev branches are protected, so that MR approval by suitable approvers is required. Extra scrutiny is required when deciding if someone should be given approver status.


Code review procedure

When code is ready for review, the author should remove the draft status of a merge request and notify the assigned reviewer. The reviewer proceeds to:

  • Read the corresponding issue(s) and understand the purpose of code.
  • Try to understand the overall meaning of the code by reading it.
  • Use the checklist below as a guide.
  • Make comments in Gitlab in the "Discussion", or make line comments on the "Changes" tab if applicable. Note that it is the person starting a discussion who normally should resolve the discussion.
  • Consider rewriting parts of the code in new commits, which they then add to the same feature branch.
  • Verify that any applicable automatic tests pass.
  • Approve the merge when the work is deemed to be of sufficient quality and ready.

The author usually merges the feature branch when the MR is approved.

Code review checklist

Use this checklist as a guide when reviewing code.

Plan

  • Is the work described properly in an issue in the issue-tracking system?
  • Have relevant coworkers been involved/informed from the start and the consequences of the changes been communicated?

Branches and commits

  • Is the development done in its own feature branch?
  • Is the branch preferably named starting with issue number, e.g. "768-rename-inputfile"?
  • If applicable, has the merge request template been used and filled out appropriately?
  • Is there a reasonably low number of changes in each commit?
  • Do the commits have describing commit messages? The preferred style is described in here.

Clarity and structure

  • Can you read the code and understand what it does within reasonable time?
  • Are comments necessary and not too many? The comments should document the thought behind the code, why the code is written how it is and the reasoning behind the choices made. The comments should normally not state what the code does (since this should really be transparent from the code itself), but when the developer thinks it is appropriate, an explanation of what the code does is welcome.
  • Are input and output described well?
  • Are names of methods and variables descriptive?
  • Is the size of modules/functions acceptable?
  • Is the code modular/re-usable?
  • Is the code non-redundant?
  • Is the code trimmed, dead/unused code removed?
  • Is the code located in the right location in the codebase?

Documentation

  • Are modules (files) described?
  • Are directories described with e.g. a README.md?
  • Will any procedures need to be modified as a result of this change? Is that work done or registered in the issue-tracking system?

Tests

  • Is the code testable?
  • Are testdata available? Have exisiting test data been used as far as possible?
  • Are there (unit) tests where applicable?
  • Are the tests run automatically if applicable, or is it easy to run them manually?
  • Has input with errors/edge-cases been tested?

Completeness and duplication

  • Does the code externalise configuration? Does it use the standard configuration files?
  • Are the remaining markers of incompleteness (TODO, FIXME, OPT) acceptable?
  • Is there any unnecessary duplication?

Other relevant docs: