Skip to content

HTS Bioinf - Storage and security of sensitive data

This document describes security policies for the patient sensitive data from the HTS sequencers and the bioinformatic pipeline. It also describes the location of the data and how long they are stored.

Sensitive data

The sensitive data relevant for this procedure are:

  • the raw sequencing data of patient DNA (FASTQ files)
  • the processed sequencing data (BAM files)
  • the variants that the pipelines identify (VCF files)
  • NIPT raw sequencing data, analysis results and relevant lab data
  • the sample number itself (also known as sample ID), even in cases when no other information is connected to it

Data storage

Sensitive data are stored at the following locations:

  • NSC servers (diagnostic area)
  • TSD infrastructure at USIT, project p22
  • OUS network drive connected to hospital PCs under K:\Systemdata\MedGen\LAB\HTS\Tolkning HTS-data

Data security

In general, sensitive data must not be brought to a computer or hard drive outside the department. Storing and handling of sensitive data is restricted to the NSC servers or the TSD infrastructure, neither of which are connected to any outside network, like the Internet. The data are also stored on servers controlled by Sykehuspartner and connected to hospital PCs. Some dedicated portable hard drives that store data in an encrypted and password protected manner can be used to transfer sensitive data between the mentioned servers and a local computer. The local computer must be offline (no active WiFi, ethernet, or Bluetooth) as long as the hard drive is connected.

Data permissions

In TSD, only p22 members can access data. Files and directories are given the following permissions:

  • Owners of files and directories have read, write and execute (rwx) permissions
  • Group p22-diag-ous-bioinfo-group members have similar (rwx) permissions
  • Others have no write permission on files or directories, only read and execute (rx)

Data locations

Sequencing data in the form of FASTQ files are delivered by NCS bioinformaticians to a specific location in the NSC infrastructure delivery area where HTS bioinformaticians only have read-access. Diagnostic bioinformaticians transfer the files to the TSD infrastructure project p22 according to the procedure "HTS Bioinf - Execution and monitoring of pipeline".

Paths on TSD

CLUSTERPROD=/cluster/projects/p22/production DURABLE=/tsd/p22/data/durable

Short-name Path Description
deploys
vcpipe $CLUSTERPROD/sw/vcpipe/vcpipe/ Deploy vcpipe code
vcpipe-refdata $CLUSTERPROD/sw/vcpipe/vcpipe-refdata/ Deploy vcpipe-refdata
vcpipe-testdata $CLUSTERPROD/sw/vcpipe/vcpipe-testdata/ Deploy vcpipe-testdata
amg $PATHTOAMGREPO Deploy custom code
tsd-import $CLUSTERPROD/sw/tsd-import/ Deploy tsd-import code
deployed sensitive-db $CLUSTERPROD/sw/vcpipe/sensitive-db/ Deploy sensitive-db data
tar.gz releases $CLUSTERPROD/sw/archive/ Releases first copied here and deployed from here
source code
source sensitive-db /cluster/projects/p22/dev/sw/sensitive-db/ Version controlled (git) source data
production
samples $CLUSTERPROD/samples Production samples (input)
analyses $CLUSTERPROD/analyses Production analyses (output)
interpretations $DURABLE/production/interpretations/ Production interpretations (output)
EllA $DURABLE/production/ella-prod/data/analyses Production EllA (output)
archives (updated by TSD Cron job)
samples archive /ess/p22/archive/production/samples/ Samples archives
analyses archive {$DURABLE}{2,3,4,5}/production/analyses/ Analyses archives
software archive {$DURABLE}{2,3,4,5}/production/sw/ Software releases archives
cluster archive {$DURABLE}2/production/cluster_backup/ Cluster archives (all but samples, analyses, software)
NIPT archive {$DURABLE}{2,3,4,5}/production/nipt/ NIPT sequencing backup

Duration of file storage

After delivery from NSC, the original files (the binary base call, BCL format) are kept at the NSC infrastructure for one month before being deleted. The rest of the files are deleted from NSC as soon as they have been successfully transferred to TSD.

These files - the original FASTQ files, the final BAM files, the variant files, the log files and NIPT sequencing and analysis data - are kept indefinitely on TSD.

Backup policy

The network drive at OUS is backed up nightly by Sykehuspartner. NSC is responsible for internal backups of data at NSC. TSD takes tape-backups every night and in addition regular snapshots for the last 3 days. Note that only the following directories are backed up at TSD:

  • p22/data/durable{2,3,4,5}
  • p22/home

The cluster (aka colossus) at /cluster/projects/p22 is not backed up by TSD. Therefore we set up a Cron job that copies data from this area to durable. The Cron source is at /tsd/p22/data/durable/production/utilities/rsync_cluster.sh. The logs and error logs for that sync are in the rsynclogs subdirectory.

Data compression ((Do NOT use the following steps until vcpipe-utilities v1.1.0 is released and deployed on TSD/NSC))

In order to save storage space, all exome and genome samples BAM files from the initial mapping (under preprocessed/singles/SAMPLE_FOLDER/data/mapping) should go through the following processes to be compressed and deleted. Only compressed BAM files (CRAM files) will be saved. The BAM file can be deleted when the CRAM file is ready.

All the scripts mentioned below are located on TSD at /cluster/projects/p22/production/sw/vcpipe-utilities/src/compression.

In the following text:

  • PROJECT_NAME: the mapping BAM files will be compressed for this project (e.g. wgs90).
  • DESTINATION_PATH: optional, default is /cluster/projects/p22/production/compression/no-backup. A new directory named PROJECT_NAME will be generated under this path.

Make BAM files available on /cluster and run the compression process : python3 BAM2cram.py --project PROJECT_NAME (e.g. python3 BAM2cram.py --project wgs) The script will copy mapping BAM files to each individual directory under DESTINATION_PATH/PROJECT_NAME and submit compression SLURM jobs.

Check whether the compression is completed : python3 checkCompression.py --path DESTINATION_PATH/PROJECT_NAME (e.g. python3 checkCompression.py --path /cluster/projects/p22/production/compression/no-backup/wgs) After compression, the script will check that all the required files (bam.stats, cram.md5, crumble. cram, crumble.cram.crai, crumble.cram.stats, log) exist, that the number of reads (raw total sequences) are equal in the bam.stats and cram.stats and that this number is equal to the 'reads' in the sample configuration file (.sample).

Rsync CRAM to the correct location on durables : python3 crumble2durable.py --project PROJECT_NAME The script will copy the compression results to the corresponding 'mapping' directory on durable.

Delete BAM files on durable/cluster and directories under : python3 deleteDurableBAM.py --project PROJECT_NAME Before deleting files in DESTINATION_PATH/PROJECT_NAME, the script will check that all compression files exist and that the durable mapping directories and the md5 checksums of CRAM and CRAI files are correct.

Patient requests for data export

If patients request access to their data, we will export the data in encrypted form and deliver them on USB thumb drive, as described in EHB procedure "Utlevering av pasientdata". Such requests can only come from OUS managers.


A risk analysis for TSD is available at K:\Felles\KDI\AMG\Risikostyring\Risikovurdering\Vedlegg_ROS_TSD_230915.docx