HTS Bioinf - Infrastructure
This document describes the different IT-infrastructures used by bioinformaticians, who administrates them, how they are managed, and briefly what each infrastructure is used for.
The work of diagnostic bioinformaticians (bioinf) is conducted on three different infrastructures, accessed from personal computers:
- Norwegian Sequencing Center (NSC/NorSeq) - the joint infrastructure for research and diagnostics in building 25
- Tjeneste for sensitive data (TSD) - the compute cluster for sensitive data at the University of Oslo (UiO)
- Development servers - both physical and cloud-based servers
NSC machines and network
NSC manages the server machines where sequencing data are initially stored and pre-processed after sequencing.
The servers used by diagnostic bioinformaticians are gdx-login
, sleipnir
, gdx-executor
, gdx-webui
and gdx-db
.
NSC has also other servers used for preprocessing (demultiplexing and quality control [FastQC]) of sequencing data but these are not directly used by diagnostic bioinformaticians.
gdx-login
is used for exporting data from the LIMS (Clarity) and for the lab to get access tosleipnir
.sleipnir
is used to transfer data (sequencing and other) from the NSC network to TSD and from TSD to NSC.- The VMs
gdx-executor
,gdx-webui
andgdx-db
are used for running pipelines on NSC.
System administrator of all NSC machines is Pål Marius Bjørnstad. Arvind Sundaram is backup administrator. NSC system administrators are responsible for keeping NSC machines up and updated, for data backup, user management, etc. There is a simple data processor agreement between NSC and HTS diagnostics.
The NSC infrastructure also consists of the servers hosting the Clarity LIMS.
Diagnostic bioinformaticians do not operate on these machines directly but communicate with them from gdx-login
via APIs.
Diagnostic sequencing data are pre-processed and quality assured in automatic pipelines before being delivered to a restricted file area only accessible by diagnostic bioinformaticians (and system administrators). When connecting to the NSC network from personal computers via cables, any other network connections (wifi and extra physical network ports, if any) must be turned off.
TSD
Tjeneste for sensitive data (TSD), a secure HPC environment for sensitive data at UiO, is the main IT-infrastructure used for storage, processing, and analysis of sequencing data. Diagnostic data are processed exclusively in project p22. The following is a list of virtual machines (VMs) and servers in project p22 at TSD, that are used by AMG diagnostic personnel:
Virtual machines
p22-rhel8-01-pool
: Standard Linux login server. Not to be used for expensive computations.p22-hpc-01
, andp22-hpc-02
as backup: Run pipeline executor (which in turn starts the Nextflow processes responsible for submitting jobs to the computing cluster)p22-hpc-02
: Runsfilelock-exporter
, which transfers sequencing data from the file lock, and the pipeline monitor webUI.p22-hpc-03
: Used for testing and development. Also backup forp22-hpc-01
andp22-hpc-02
.p22-anno-01
: Runs annotation servicesp22-app-01
: General purpose VM for running services with low requirements to CPU/RAM, and services that are not deemed essential for production. Can also be used for running one-off tasks/chores using Singularity.p22-podman-01
: General purpose VM for running anything Podman related. Useful as a testing ground for podman work, and running one-off tasks/chores using podman.p22-podman-02
: Reserved for running ELLA production using Podman/docker composep22-podman-03
: Reserved for running ELLA staging using Podman/docker composep22-ella-01
: Runs ELLA productionp22-ella-fo-01
: Failover VM for ELLA productionp22-ella-dev
: Runs ELLA development/validation versionsp22-ella-stage
: Runs ELLA staging (release testing)p22-dbpg-prod01
: (p22-dbpg02.tsd.usit.no): Runs database server for ELLAp22-dbpg-prod02
: (p22-dbpg03.tsd.usit.no): Runs database server for the pipeline executor and othersp22-win01
: Windows login, failoverp22-win02
: Windows login used by lab engineers from OUS, mostly EKGp22-win03
: Windows login used by lab engineers from OUS, mostly EGG, ELL
The Postgres VMs are managed as a service by USIT and only USIT can access them.
http://p22-app-01:8080
has an overview of all webservices.
If you are using the VMware Horizon client and get a blank/white screen after opening a session, you may need to disable VMware Blast. In the main window, go to File -> Configure VMware Blast. Uncheck anything that is checked and try reconnecting.
Database naming
The databases in TSD run on the database VMs listed above. Note that their names can be misleading. There can be several databases per VM.
- DBs for ELLA:
vardb{, _staging, _test, _validation}
run onp22-dbpg-prod01
VM. - DBs for MegaQC:
p22_megaqc{, _validation}
run onp22-dbpg-prod02
VM. - DBs for vcpipe:
vcpipe{, _staging, _test}
run onp22-dbpg-prod02
VM.
Administration and access
All machines are administrated by the system administrators of TSD at USIT. Any requests should go to TSD Drift tsd-drift@usit.uio.no. In case of unexpected downtime with significant impact (long duration etc), the bioinformatic coordinator should contact the leader of TSD directly by phone or email to expedite the ticket handling.
The bioinformatic group coordinator is the TSD administrator of the p22 TSD project, and is responsible for user management. New users outside the diagnostic bioinformatics group must be approved by the unit managers at AMG. The unit managers should also report when user accounts should be terminated. Additionally, the bioinformatic coordinator should periodically go through the list of users with the unit managers and remove any inactive user accounts.
Development servers
Bioinformaticians use personal computers in general, but rely on development servers for doing heavier computations, such as running and testing variant calling pipelines and storing large data sets. The following servers and cloud services are used for development:
zed
: Development and CI/testing server, a Hetzner Robot machine. Runs tests/jobs triggered by Gitlab. Dedicated team members have admin access.gitlab.com
: Git server.digitalocean.com
: Mainly storage of large files and test runners.
DigitalOcean is used for storing various test and reference data like the reference genome and is administrated by the bioinformatic coordinator.
See also
- Storage and security of sensitive data - for how to handle sensitive data and backup routines
- Software administration - for who is responsible for keeping server software updated
- HTS - Overordnet beskrivelse av arbeidsflyt - for an overview of the processes and the dataflow from NSC to TSD