Skip to content

HTS Bioinf - Software development and review

This document describes process, routines, and setup for development work.

Requests for modifications and issue-tracking

Day to day requests for modifications to existing pipelines and new "small" feature requests are continuously gathered and stored as issues in the issue-tracking system by all bioinformaticians. External user requests (from lab and doctors) outside meetings should be sent to the mailing list bioinf-help and should follow a template if they require a project. The bioinformatician on helpdesk duty has the final responsibility for issues being created from requests received by email. Issue tracking is described in a separate document.

Development schedule and releases

Issues are prioritized by the bioinformatics group coordinator or others having a "product owner" role, by ranking them in the relevant issue boards. Some projects follow a Kanban style of work, whereas development on other projects can be split into regular iterations where a subset of issues are selected for focused work.

Each system/repo has a designated release responsible who plans and performs a release. The system owner usually has the release responsibility as well.

Programming languages and settings

Bioinformaticians primarily standardise on the following programming languages:

  • Python, Nextflow and Bash for apps/pipeline/scripts.
  • JavaScript and TypeScript with the React framework for web development.

Python is used as the general purpose language. We use Pytest for unit testing. Ruff is used for (flake8) linting and (Black) formatting with Ruff.lint.args specified to --line-length=100 ( VSCode extension). Poetry is used to handle Python package dependencies and virtual environments. For Bash, we run ShellCheck, for which there is a VSCode extension. Setup for web development is more comprehensive and is described in detail elsewhere.

Code is made to run in a Linux environment, but with an attempt to also accommodate MacOS users who have Coreutils installed. VSCode is the recommended code editor.

Containerization

Containers (Docker, Podman, and Singularity), are heavily used for both development and production work to create isolated environments with all dependencies. To simplify development, repositories are usually set up with a devcontainer.

  • Docker is used in all projects for containerizing all project dependencies
  • docker compose should be used for orchestrating multi-container apps (even single-container apps with volumes or ports).
  • VSCode devcontainer should accompany every project for a standardized dev environment

Every project should have a:

  • docker-compose.yml-file specifying ports/volumes/environments for each service
  • Dockerfile in the root of the project if it is a single-container project
  • Dockerfile for each service (unless using a pre-built image) if it is a multi-container project
  • devcontainer/.devcontainer.json-file specifying the devcontainer setup
  • .env-file for default values used in docker-compose.yml (e.g. local ports, local bind mounts etc)

Docs and diagrams

To generate documentation we use Markdown. Diagrams, e.g. C4 architecture diagrams, are stored as code in the form of Mermaid.js or PlantUML code, or as diagrams.net files.

Templates and help

The template Project Skeleton has an educational walk-through of how we set up projects that use Python, Docker, devcontainers for VSCode, a simple FASTAPI/React app with BDD tests, and a command line interface (CLI). This project acts as a reference and should be studied by all developers.

Github Co-Pilot is a highly recommended tool to assist the development process. To use it, you need to purchase an individual license.

Version control and branching

Both code and reference data are version-controlled using the git version control system. The code is stored in different repositories and Gitlab.com is used as a remote/central repository and code control system. Three Gitlab organisations are currently in use:

  • DPIPE for public open source pipelines and systems
  • ELLA for public ELLA and ELLA-related projects
  • OUSAMG for a small number of private or legacy projects

Access to Gitlab is controlled by the System administrator or the group coordinator. Core branches (main/master, and dev) are protected. See below for more in-depth information on access control, and the document "How to start a new Gitlab project repo" for more details on the configuration of projects.

All development is done in separate git branches, using a "Gitflow" branching model. When a developer thinks that a feature branch is ready, and the branch has been (interactively) rebased against an updated dev branch, the merge request against dev is marked as ready in Gitlab, and, following a passing code review by another developer, the dev branch is updated by merging in changes from the feature branch with a merge commit. Protected branches and main in particular should as a rule not be updated directly, but only through merge requests from dev, release or hotfix branches. Changes to protected branches require an approval in a Gitlab merge request by another team member with permission to approve. Approval must be done by someone having the role of code owner. The approval serves as documentation of the review.

Git commit messages

Take the time to write proper commit messages! See this guide and write a concise summary line (max 80 chars). If the commit is relatively small and atomic, it should not prove too difficult to write a specific message. Feel free to add more info after a blank line.

The first commit message in a new branch may mention the issue number. Then it is easy to see in the git history which commits belong together from the merge commit and to what issue they belong (git log --oneline --graph --decorate)

Merge requests (MRs)

As a rule, every MR should be associated to an issue. The problem should be described in the issue, the proposed solution should be described in the MR. The issue should be created first. The branch addressing it and its associated MR can be created directly from the issue in the Gitlab UI.

Before filing an MR, consider interactively rebasing your feature branch to adjust your commits and messages and make them easier to read for the prospective reviewer. You can then squash some commits, split commits, re-stage changes that naturally belong together, and add better messages. Irrelevant commits (fix typo etc) should be squashed.

MRs should be as small as possible and change "one coherent thing", i.e. they should not be a random lump of massive changes. This will make them much easier to review. And the reviewer can then go commit by commit and follow the logic of the changes. It is allowed to create more than one MR for a single issue.

MR description templates should be used. The template files themselves can be found in the .gitlab directories of established projects. Suggestions for improvements are welcome.

MR reviews are often much faster and easier to do if the author sits down with the reviewer and explains the changes. Use the opportunity to learn from each other.

Production code and versioning

Code running in production should be always identical to tagged releases from the main/master branch, and typically generated from those using git archive. If critical modifications are needed outside of the planned releases, a maintenance branch is forked directly off of the tagged commit in main/master. As soon as the fix is complete, the maintenance branch should be merged into both main/master and dev, and main/master should be tagged with an updated version number. We follow semantic version naming (i.e. major.minor.patch).

Generally, all new features and changes shall be tested, documented and approved. Before deploying a release, an "endringskontroll" must be compiled with a description of the changes, how testing was done, which data were used and who approved the test results. Only after the "endringskontroll" has been approved, the deployment can proceed. Write enough info in the issue descriptions so that the "endringskontroll" are easier to write. If sensitive information is involved, create a directory on TSD/NSC (on a common, known location). Put date and a short description in the directory name, like 20210820-pipeline-release-3.1. The documentation should be specific enough to allow for easy re-testing.

Automated testing

Repositories are typically configured with automated tests that run on every git push or merge. To standardise test environments, these tests run in docker containers that are built as part of the tests together with the code artifacts. Tests can also be run manually via the Gitlab interface or via make commands.

(Regression tests of the pipelines with static control samples (i.e. the same data every time) should be run before deployment in production. Regression tests with re-sequenced control samples (i.e. new data) are run from time to time).

Access to projects

Public projects can be read by anyone, but any modification to code or issues requires membership in the group or project. Only users with Owner role can add members to groups, whereas Maintainers can add members directly to individual projects. Only developers at OUSAMG or trusted collaborators should have membership. This should normally be controlled at group level, with users given Developer roles. Alternatively, direct access to a single project can be granted if more relevant. Note that as an additional security measure, the main/master and dev branches are protected, so that MR approval by suitable approvers is required. Extra scrutiny is required when deciding if someone should be given approver status.


Code review procedure

When code is ready for review, the author should remove the draft status of a merge request and notify the assigned reviewer. The reviewer proceeds to:

  • Read the corresponding issue(s) and understand the purpose of code.
  • Try to understand the overall meaning of the code by reading it.
  • Use the checklist below as a guide.
  • Make comments in Gitlab in the "Discussion", or make line comments on the "Changes" tab if applicable. Note that it is the person starting a discussion who normally should resolve the discussion.
  • Consider rewriting parts of the code in new commits, which they then add to the same feature branch.
  • Verify that any applicable automatic tests pass.
  • Approve the merge when the work is deemed to be of sufficient quality and ready.

The author usually merges the feature branch when the MR is approved.

Code review checklist

Use this checklist as a guide when reviewing code.

Plan

  • Is the work described properly in an issue in the issue-tracking system?
  • Have relevant coworkers been involved/informed from the start and the consequences of the changes been communicated?

Branches and commits

  • Is the development done in its own feature branch?
  • Is the branch preferably named starting with issue number, e.g. "768-rename-inputfile"?
  • If applicable, has the merge request template been used and filled out appropriately?
  • Is there a reasonably low number of changes in each commit?
  • Do the commits have describing commit messages? The preferred style is described in here.

Clarity and structure

  • Can you read the code and understand what it does within reasonable time?
  • Are comments necessary and not too many? The comments should document the thought behind the code, why the code is written how it is and the reasoning behind the choices made. The comments should normally not state what the code does (since this should really be transparent from the code itself), but when the developer thinks it is appropriate, an explanation of what the code does is welcome.
  • Are input and output described well?
  • Are names of methods and variables descriptive?
  • Is the size of modules/functions acceptable?
  • Is the code modular/re-usable?
  • Is the code non-redundant?
  • Is the code trimmed, dead/unused code removed?
  • Is the code located in the right location in the codebase?

Documentation

  • Are modules (files) described?
  • Are directories described with e.g. a README.md?
  • Will any procedures need to be modified as a result of this change? Is that work done or registered in the issue-tracking system?

Tests

  • Is the code testable?
  • Are testdata available? Have exisiting test data been used as far as possible?
  • Are there (unit) tests where applicable?
  • Are the tests run automatically if applicable, or is it easy to run them manually?
  • Has input with errors/edge-cases been tested?

Completeness and duplication

  • Does the code externalise configuration? Does it use the standard configuration files?
  • Are the remaining markers of incompleteness (TODO, FIXME, OPT) acceptable?
  • Is there any unnecessary duplication?

Other relevant docs: