Skip to content

HTS Bioinf - Storage and security of sensitive data

This document describes security policies for sensitive patient data from the HTS sequencers and the bioinformatic pipeline. It also describes the location of the data and how long they are stored.

Sensitive data

The sensitive data relevant for this procedure are:

  • the raw patient DNA sequencing data (FASTQ files)
  • the processed sequencing data (BAM files)
  • the variants that the pipelines identify (VCF files)
  • NIPT raw sequencing data, analysis results and relevant lab data
  • the sample number itself (also known as sample ID), even in cases when no other information is connected to it

Data storage

Sensitive data are stored at the following locations:

  • NSC servers (diagnostic area)
  • TSD infrastructure at USIT, project p22
  • OUS network drive connected to hospital PCs under K:\Systemdata\MedGen\LAB\HTS\Tolkning HTS-data

Data security

In general, sensitive data must not be brought to a computer or hard drive outside the department. Storing and handling of sensitive data is restricted to the NSC servers or the TSD infrastructure, neither of which are connected to any outside network, like the Internet. The data are also stored on servers controlled by Sykehuspartner and connected to hospital PCs. Some dedicated portable hard drives that store data in an encrypted and password-protected manner can be used to transfer sensitive data between the mentioned servers and a local computer. The local computer must be offline (no active WiFi, Ethernet, or Bluetooth) as long as the hard drive is connected.

Data permissions

In TSD, only p22 members can access data. Files and directories are given the following permissions:

  • Owners of files and directories have read, write and execute (rwx) permissions
  • Group p22-diag-ous-bioinfo-group members have similar (rwx) permissions
  • Others have no write permission on files or directories, only read and execute (rx)

Data locations

Sequencing data in the form of FASTQ files are delivered by NSC bioinformaticians to a specific location in the NSC infrastructure delivery area where HTS bioinformaticians only have read-access. Diagnostic bioinformaticians transfer the files to the TSD infrastructure project p22 according to the procedure "HTS Bioinf - Execution and monitoring of pipeline".

Paths on TSD

DURABLE=/ess/p22/data/durable

Short-name Path Description
deploys
vcpipe $DURABLE/production/sw/variantcalling/vcpipe Deploy vcpipe code
vcpipe public reference data $DURABLE/production/reference/public Deploy vcpipe public reference data
vcpipe sensitive reference data $DURABLE/production/reference/sensitive Deploy vcpipe sensitive reference data
tsd-import $DURABLE/production/sw/automation/tsd-import Deploy tsd-import code
releases $DURABLE/production/sw/archive Releases are copied and deployed from here
sensitive database code $DURABLE/development/sw/sensitive-db-factory Version controlled (git) source data
production
analyses-work $DURABLE/production/data/analyses-work Production analyses (Nextflow work directory)
analyses-results $DURABLE/production/data/analyses-results Production analyses results after completed work
samples $DURABLE/production/data/samples Production storage of sequence data and metadata
interpretations $DURABLE/production/interpretations Production interpretations (output)
EllA $DURABLE/production/ella/ella-prod/data/analyses Production EllA (output)
archives
analyses archive $DURABLE/production/archive/analyses Analyses archives
software archive $DURABLE/production/sw/archive Software releases archives
NIPT archive $DURABLE/production/archive/nipt NIPT lab, sequencing and output files backup

Duration of file storage

After delivery from NSC, the original files (the binary base call, BCL format) are kept at the NSC infrastructure (/boston/diag/runs) for 14 days before being deleted. The rest of the files are deleted from NSC as soon as they have been successfully transferred to TSD.

These files - the original FASTQ files, the final BAM files, the variant files, the log files and NIPT sequencing and analysis data - are kept indefinitely on TSD.

Backup policy

The network drive at OUS is backed up nightly by Sykehuspartner. NSC is responsible for internal backups of data at NSC. TSD takes tape-backups every night and in addition regular snapshots for the last 3 days. Note that only the following directories are backed up at TSD:

  • /ess/p22/data/durable
  • /ess/p22/home

Data compression [DEPRECATED: Do NOT read/run the following steps until vcpipe-utilities v1.1.0 is released and deployed on TSD/NSC]

In order to save storage space, all exome and genome samples BAM files from the initial mapping (/ess/p22/data/durable/production/data/analyses-results/singles/<ANALYSIS_DIRECTORY>/data/mapping) should go through the following processes to be compressed and deleted. Only compressed BAM files (CRAM files) will be saved. The BAM file can be deleted when the CRAM file is ready.

All the scripts mentioned below are located in TSD at /ess/p22/data/durable/production/sw/utils/vcpipe-utilities/src/compression.

In the following text:

  • PROJECT_NAME: the mapping BAM files will be compressed for this project (e.g. wgs90).
  • DESTINATION_PATH: optional, default is /ess/p22/data/durable/production/compression/no-backup. A new directory named PROJECT_NAME will be generated under this path.

Make sure the BAM files are accessible from the computing nodes and run the compression process: python3 BAM2cram.py --project PROJECT_NAME (e.g. python3 BAM2cram.py --project wgs) The script will copy mapping BAM files to each individual directory under DESTINATION_PATH/PROJECT_NAME and submit compression jobs to the computing cluster.

Check whether the compression is completed: python3 checkCompression.py --path DESTINATION_PATH/PROJECT_NAME (e.g. python3 checkCompression.py --path /ess/p22/data/durable/production/compression/no-backup/wgs). After compression, the script will check that all the required files (bam.stats, cram.md5, crumble.cram, crumble.cram.crai, crumble.cram.stats, log) exist, that the number of reads (raw total sequences) are equal in the bam.stats and cram.stats and that this number is equal to the 'reads' in the sample configuration file (.sample).

Rsync CRAM to the correct location on durables: python3 crumble2durable.py --project PROJECT_NAME The script will copy the compression results to the corresponding 'mapping' directory on durable.

Delete the original BAM files: python3 deleteDurableBAM.py --project PROJECT_NAME Before deleting files in DESTINATION\_PATH/PROJECT\_NAME, the script will check that all compression files exist and that the durable mapping directories and the MD5 checksums of CRAM and CRAI files are correct.

Patient requests for data export

If patients request access to their data, we will export the data in encrypted form and deliver them on USB thumb drive, as described in EHB procedure "Utlevering av pasientdata". Such requests can only come from OUS managers.


A risk analysis for TSD is available at K:\Felles\KDI\AMG\Risikostyring\Risikovurdering\Vedlegg_ROS_TSD_230915.docx