HTS Bioinf - Storage and security of sensitive data
This document describes security policies for sensitive patient data from the HTS sequencers and the bioinformatic pipeline. It also describes the location of the data and how long they are stored.
Sensitive data
The sensitive data relevant for this procedure are:
- the raw patient DNA sequencing data (FASTQ files)
- the processed sequencing data (BAM files)
- the variants that the pipelines identify (VCF files)
- NIPT raw sequencing data, analysis results and relevant lab data
- the sample number itself (also known as sample ID), even in cases when no other information is connected to it
Data storage
Sensitive data are stored at the following locations:
- NSC servers (diagnostic area)
- TSD infrastructure at USIT, project p22
- OUS network drive connected to hospital PCs under
K:\Systemdata\MedGen\LAB\HTS\Tolkning HTS-data
Data security
In general, sensitive data must not be brought to a computer or hard drive outside the department. Storing and handling of sensitive data is restricted to the NSC servers or the TSD infrastructure, neither of which are connected to any outside network, like the Internet. The data are also stored on servers controlled by Sykehuspartner and connected to hospital PCs. Some dedicated portable hard drives that store data in an encrypted and password-protected manner can be used to transfer sensitive data between the mentioned servers and a local computer. The local computer must be offline (no active WiFi, Ethernet, or Bluetooth) as long as the hard drive is connected.
Data permissions
In TSD, only p22 members can access data. Files and directories are given the following permissions:
- Owners of files and directories have read, write and execute (rwx) permissions
- Group
p22-diag-ous-bioinfo-group
members have similar (rwx) permissions - Others have no write permission on files or directories, only read and execute (rx)
Data locations
Sequencing data in the form of FASTQ files are delivered by NSC bioinformaticians to a specific location in the NSC infrastructure delivery area where HTS bioinformaticians only have read-access. Diagnostic bioinformaticians transfer the files to the TSD infrastructure project p22 according to the procedure "HTS Bioinf - Execution and monitoring of pipeline".
Paths on TSD
DURABLE=/ess/p22/data/durable
Short-name | Path | Description |
---|---|---|
deploys | ||
vcpipe | $DURABLE/production/sw/variantcalling/vcpipe |
Deploy vcpipe code |
vcpipe public reference data | $DURABLE/production/reference/public |
Deploy vcpipe public reference data |
vcpipe sensitive reference data | $DURABLE/production/reference/sensitive |
Deploy vcpipe sensitive reference data |
tsd-import | $DURABLE/production/sw/automation/tsd-import |
Deploy tsd-import code |
releases | $DURABLE/production/sw/archive |
Releases are copied and deployed from here |
sensitive database code | $DURABLE/development/sw/sensitive-db-factory |
Version controlled (git) source data |
production | ||
analyses-work | $DURABLE/production/data/analyses-work |
Production analyses (Nextflow work directory) |
analyses-results | $DURABLE/production/data/analyses-results |
Production analyses results after completed work |
samples | $DURABLE/production/data/samples |
Production storage of sequence data and metadata |
interpretations | $DURABLE/production/interpretations |
Production interpretations (output) |
EllA | $DURABLE/production/ella/ella-prod/data/analyses |
Production EllA (output) |
archives | ||
analyses archive | $DURABLE/production/archive/analyses |
Analyses archives |
software archive | $DURABLE/production/sw/archive |
Software releases archives |
NIPT archive | $DURABLE/production/archive/nipt |
NIPT lab, sequencing and output files backup |
Duration of file storage
After delivery from NSC, the original files (the binary base call, BCL
format) are kept at the
NSC infrastructure (/boston/diag/runs
) for 14 days before being deleted. The rest of the
files are deleted from NSC as soon as they have been successfully transferred to TSD.
These files - the original FASTQ files, the final BAM files, the variant files, the log files and NIPT sequencing and analysis data - are kept indefinitely on TSD.
Backup policy
The network drive at OUS is backed up nightly by Sykehuspartner. NSC is responsible for internal backups of data at NSC. TSD takes tape-backups every night and in addition regular snapshots for the last 3 days. Note that only the following directories are backed up at TSD:
/ess/p22/data/durable
/ess/p22/home
Data compression [DEPRECATED: Do NOT read/run the following steps until vcpipe-utilities v1.1.0 is released and deployed on TSD/NSC]
In order to save storage space, all exome and genome samples BAM files from the initial mapping
(/ess/p22/data/durable/production/data/analyses-results/singles/<ANALYSIS_DIRECTORY>/data/mapping
)
should go through the following processes to be compressed and deleted. Only compressed BAM files
(CRAM files) will be saved.
The BAM file can be deleted when the CRAM file is ready.
All the scripts mentioned below are located in TSD at
/ess/p22/data/durable/production/sw/utils/vcpipe-utilities/src/compression
.
In the following text:
PROJECT_NAME
: the mapping BAM files will be compressed for this project (e.g.wgs90
).DESTINATION_PATH
: optional, default is/ess/p22/data/durable/production/compression/no-backup
. A new directory namedPROJECT_NAME
will be generated under this path.
Make sure the BAM files are accessible from the computing nodes and run the compression process:
python3 BAM2cram.py --project PROJECT_NAME
(e.g. python3 BAM2cram.py --project wgs
)
The script will copy mapping BAM files to each individual directory under
DESTINATION_PATH/PROJECT_NAME
and submit compression jobs to the computing cluster.
Check whether the compression is completed:
python3 checkCompression.py --path DESTINATION_PATH/PROJECT_NAME
(e.g. python3 checkCompression.py --path /ess/p22/data/durable/production/compression/no-backup/wgs
).
After compression, the script will check that all the required files (bam.stats
, cram.md5
,
crumble.cram
, crumble.cram.crai
, crumble.cram.stats
, log
) exist, that the number of reads
(raw total sequences) are equal in the bam.stats
and cram.stats and that this number is equal to
the 'reads' in the sample configuration file (.sample
).
Rsync CRAM to the correct location on durables:
python3 crumble2durable.py --project PROJECT_NAME
The script will copy the compression results to the corresponding 'mapping' directory on durable.
Delete the original BAM files:
python3 deleteDurableBAM.py --project PROJECT_NAME
Before deleting files in DESTINATION\_PATH/PROJECT\_NAME
, the script will check that all
compression files exist and that the durable mapping directories and the MD5 checksums of CRAM and
CRAI files are correct.
Patient requests for data export
If patients request access to their data, we will export the data in encrypted form and deliver them on USB thumb drive, as described in EHB procedure "Utlevering av pasientdata". Such requests can only come from OUS managers.
A risk analysis for TSD is available at K:\Felles\KDI\AMG\Risikostyring\Risikovurdering\Vedlegg_ROS_TSD_230915.docx