HTS Bioinf - Storage and security of sensitive data
This document describes security policies for the patient sensitive data from the HTS sequencers and the bioinformatic pipeline. It also describes the location of the data and how long they are stored.
The sensitive data relevant for this procedure are:
- the raw sequencing data of patient DNA (FASTQ files)
- the processed sequencing data (BAM files)
- the variants that the pipelines identify (VCF files)
- NIPT raw sequencing data, analysis results and relevant lab data
- the sample number itself (also known as sample ID), even in cases when no other information is connected to it
Sensitive data are stored at the following locations:
- NSC servers (diagnostic area)
- TSD infrastructure at USIT, project p22
- OUS network drive connected to hospital PCs under
In general, sensitive data must not be brought to a computer or hard drive outside the department. Storing and handling of sensitive data is restricted to the NSC servers or the TSD infrastructure, neither of which are connected to any outside network, like the Internet. The data are also stored on servers controlled by Sykehuspartner and connected to hospital PCs. Some dedicated portable hard drives that store data in an encrypted and password protected manner can be used to transfer sensitive data between the mentioned servers and a local computer. The local computer must be offline (no active WiFi, ethernet, or Bluetooth) as long as the hard drive is connected.
In TSD, only p22 members can access data. Files and directories are given the following permissions:
- Owners of files and directories have read, write and execute (rwx) permissions
p22-diag-ous-bioinfo-groupmembers have similar (rwx) permissions
- Others have no write permission on files or directories, only read and execute (rx)
Sequencing data in the form of FASTQ files are delivered by NCS bioinformaticians to a specific location in the NSC infrastructure delivery area where HTS bioinformaticians only have read-access. Diagnostic bioinformaticians transfer the files to the TSD infrastructure project p22 according to the procedure "HTS Bioinf - Execution and monitoring of pipeline".
Paths on TSD
||Deploy vcpipe code|
||Deploy custom code|
||Deploy tsd-import code|
||Deploy sensitive-db data|
||Releases first copied here and deployed from here|
||Version controlled (git) source data|
||Production samples (input)|
||Production analyses (output)|
||Production interpretations (output)|
||Production EllA (output)|
|archives (updated by TSD Cron job)|
||Software releases archives|
||Cluster archives (all but samples, analyses, software)|
||NIPT sequencing backup|
Duration of file storage
After delivery from NSC, the original files (the binary base call,
BCL format) are kept at the
NSC infrastructure for one month before being deleted. The rest of the files are deleted from
NSC as soon as they have been successfully transferred to TSD.
These files - the original FASTQ files, the final BAM files, the variant files, the log files and NIPT sequencing and analysis data - are kept indefinitely on TSD.
The network drive at OUS is backed up nightly by Sykehuspartner. NSC is responsible for internal backups of data at NSC. TSD takes tape-backups every night and in addition regular snapshots for the last 3 days. Note that only the following directories are backed up at TSD:
The cluster (aka colossus) at
/cluster/projects/p22 is not backed up by TSD.
Therefore we set up a Cron job that copies data from this area to durable.
The Cron source is at
The logs and error logs for that sync are in the
Data compression ((Do NOT use the following steps until vcpipe-utilities v1.1.0 is released and deployed on TSD/NSC))
In order to save storage space, all exome and genome samples BAM files from the initial mapping
preprocessed/singles/SAMPLE_FOLDER/data/mapping) should go through the following processes
to be compressed and deleted. Only compressed BAM files (CRAM files) will be saved. The BAM file
can be deleted when the CRAM file is ready.
All the scripts mentioned below are located on TSD at
In the following text:
PROJECT_NAME: the mapping BAM files will be compressed for this project (e.g. wgs90).
DESTINATION_PATH: optional, default is
/cluster/projects/p22/production/compression/no-backup. A new directory named
PROJECT_NAMEwill be generated under this path.
Make BAM files available on
/cluster and run the compression process :
python3 BAM2cram.py --project PROJECT_NAME (e.g.
python3 BAM2cram.py --project wgs)
The script will copy mapping BAM files to each individual directory under
DESTINATION_PATH/PROJECT_NAME and submit compression SLURM jobs.
Check whether the compression is completed :
python3 checkCompression.py --path DESTINATION_PATH/PROJECT_NAME
python3 checkCompression.py --path /cluster/projects/p22/production/compression/no-backup/wgs)
After compression, the script will check that all the required files (bam.stats, cram.md5, crumble.
cram, crumble.cram.crai, crumble.cram.stats, log) exist, that the number of reads (raw total
sequences) are equal in the bam.stats and cram.stats and that this number is equal to the 'reads'
in the sample configuration file (.sample).
Rsync CRAM to the correct location on durables :
python3 crumble2durable.py --project PROJECT_NAME
The script will copy the compression results to the corresponding 'mapping' directory on durable.
Delete BAM files on durable/cluster and directories under :
python3 deleteDurableBAM.py --project PROJECT_NAME
Before deleting files in DESTINATION_PATH/PROJECT_NAME, the script will check that all
compression files exist and that the durable mapping directories and the md5 checksums of CRAM and
CRAI files are correct.
Patient requests for data export
If patients request access to their data, we will export the data in encrypted form and deliver them on USB thumb drive, as described in EHB procedure "Utlevering av pasientdata". Such requests can only come from OUS managers.
A risk analysis for TSD is available at