Skip to content

HTS Bioinf - Infrastructure

This document describes the different IT-infrastructures used by bioinformaticians, who administrates them, how they are managed, and briefly what each infrastructure is used for.

The work of diagnostic bioinformaticians (bioinf) is conducted on three different infrastructures, accessed from personal computers:

  • Norwegian Sequencing Center (NSC/NorSeq) - the joint infrastructure for research and diagnostics in building 25
  • Tjeneste for sensitive data (TSD) - the compute cluster for sensitive data at the University of Oslo (UiO)
  • Development servers - both physical and cloud-based servers

NSC machines and network

NSC manages the server machines where sequencing data are initially stored and pre-processed after sequencing.

The servers used by diagnostic bioinformaticians are beta, sleipnir, diag-executor, diag-webui and diag-db. NSC has also other servers used for preprocessing (de-multiplexing and FASTQC) of sequencing data but these are not directly used by diagnostic bioinformaticians.

  • beta is used for exporting data from the LIMS (Clarity) and for the lab to get access to sleipnir.
  • sleipnir is used to transfer data (sequencing and other) from the NSC network to TSD and from TSD to NSC.
  • The VMs diag-executor, diag-webui and diag-db are used for running pipelines on NSC.

System administrator of all NSC machines is Pål Marius Bjørnstad. Arvind Sundaram is backup administrator. NSC system administrators are responsible for keeping NSC machines up and updated, for data backup, user management, etc. There is a simple data processor agreement between NSC and HTS diagnostics.

The NSC infrastructure also consists of the servers hosting the Clarity LIMS. Diagnostic bioinformaticians do not operate on these machines directly but communicate with them from beta via APIs.

Diagnostic sequencing data are pre-processed and quality assured in automatic pipelines before being delivered to a restricted file area only accessible by diagnostic bioinformaticians (and system administrators). When connecting to the NSC network from personal computers via cables, any other network connections (wifi and extra physical network ports, if any) must be turned off.

TSD

Tjeneste for sensitive data (TSD), a secure HPC environment for sensitive data at UiO, is the main IT-infrastructure used for storage, processing, and analysis of sequencing data. Diagnostic data are processed exclusively in project p22. The following is a list of virtual machines (VMs) and servers in project p22 at TSD, that are used by AMG diagnostic personnel:

Virtual machines

  • p22-rhel8-01-pool: Standard Linux login server
  • p22-anno-01: Runs annotation services
  • p22-app-01: Runs other applications, e.g. MegaQC
  • p22-cluster-sync: Runs filelock-exporter, which transfers sequencing data from the file lock (on /tsd/p22/data/durable) to /cluster
  • p22-durable-sync: Transfers sequencing and result data from /cluster to /tsd/p22/data/durable
  • p22-dbpg-prod01 (p22-dbpg02.tsd.usit.no): Runs database server for ELLA
  • p22-dbpg-prod02 (p22-dbpg03.tsd.usit.no): Runs database server for the pipeline executor and others
  • p22-ella-01: Runs ELLA production
  • p22-ella-dev: Runs ELLA development/validation versions
  • p22-ella-stage: Runs ELLA staging (release testing)
  • p22-submit, p22-submit2: Run pipeline executor (which in turn starts the Nextflow processes responsible for submitting jobs to the computing cluster) and pipeline monitor WebUI
  • p22-submit-dev: Backup for p22-submit, p22-submit2 and p22-cluster-sync; also used for pipeline testing
  • p22-win01: Windows login, failover
  • p22-win02: Windows login used by lab engineers from OUS, mostly EKG
  • p22-win03: Windows login used by lab engineers from OUS, mostly EGG, ELL

The Postgres VMs are managed as a service by USIT and only USIT can access them.

Database naming

The databases in TSD run on the database VMs listed above. Note that their names can be misleading. There can be several databases per VM.

  • DBs for ELLA: vardb{, _staging, _test, _validation} run on p22-dbpg-prod01 VM.
  • DBs for MegaQC: p22_megaqc{, _validation} run on p22-dbpg-prod02 VM.
  • DBs for vcpipe: vcpipe{, _staging, _test} run on p22-dbpg-prod02 VM.

Administration and access

All machines are administrated by the system administrators of TSD at USIT. Any requests should go to TSD Drift tsd-drift@usit.uio.no. In case of unexpected downtime with significant impact (long duration etc), the bioinformatic coordinator should contact the leader of TSD directly by phone or email to expedite the ticket handling.

The bioinformatic group coordinator is the TSD administrator of the p22 TSD project, and is responsible for user management. New users outside the diagnostic bioinformatics group must be approved by the unit managers at AMG. The unit managers should also report when user accounts should be terminated. Additionally, the bioinformatic coordinator should periodically go through the list of users with the unit managers and remove any inactive user accounts.

Development servers

Bioinformaticians use personal computers in general, but rely on development servers for doing heavier computations, such as running and testing variant calling pipelines and storing large datasets. The following servers and cloud services are used for development:

  • focus.uio.no : Development server. Users can log in to develop and test.
  • tomato.uio.no : CI/testing server. Runs tests/jobs triggered by Gitlab. Only admins have access to do maintenance.
  • gitlab.com: Git server.
  • digitalocean.com: Mainly storage of large files and test runners.

The servers focus and tomato are located in the server room in the basement of building 25. The main system administrator of these machines is Erik Severinsen. He is also responsible for Gitlab. DigitalOcean is used for storing various test and reference data like the reference genome and is administrated by the bioinformatic coordinator.

The issue-tracking system Jira is run in the cloud and the infrastructure is administrated by Atlassian software, however the Jira projects belonging to AMG are administrated by the bioinformatic coordinator.

See also