Architecture
System
Sample
A sample is defined as a set of reads from a single library on a sequencer. It consists of two fastq files, a sample metadata file and FastQC quality information (optional for NovaSeq X).
Currently a sample metadata file looks like the following:
{
"index": "GCCAAT",
"lane": 3,
"stats": {
"one_mismatch_index_pct": 2.24,
"reads": 121434860,
"mean_qual_score": 34.98,
"q30_bases_pct": 88.94,
"perfect_index_reads_pct": 97.76
},
"sequencer_id": "D00132",
"flowcell_id": "C6HWKANXX",
"reads": [
{
"path": "123456789-PCD-v01-KIT-Av5_GCCAAT_L007_R1_001.fastq.gz",
"md5": "7ef7def836c94bd89fcac7d00f230d59",
"size": 1732022
},
{
"path": "123456789-PCD-v01-KIT-Av5_GCCAAT_L007_R2_001.fastq.gz",
"md5": "dfd786e8c36c373f322cbbd08a67d8d4",
"size": 1699397
}
],
"project": "Diag-excap01",
"project_date": "2015-02-19",
"flowcell": "A",
"sample_id": "123456789",
"capturekit": "agilent_sureselect_v05",
"sequence_date": "2015-02-26",
"taqman": "Diag-excap01-123456789.taqman",
"name": "Diag-excap01-123456789"
}
Analysis
An analysis is defined as one single entry process, performing an analysis on one or more samples. It currently has these properties:
- name: The name of this analysis. Must be unique for all analyses.
- type: What kind of analysis to perform on the sample(s). In practice this is the name of the pipeline to use.
- sampleId: One or more sample ids to use for the specified analysis type. Some analysis types require a certain number of samples.
- params: All other parameters relevant for the specified analysis type. These will vary from type to type, depending on what parameters they need.
On the file system, an analysis should be in it's own folder, with a JSON metadata file inside with extension .analysis
. Example of analysis metadata file:
{
"params": {
"taqman": "true"
},
"type": "basepipe",
"name": "Diag-excap01-123456789",
"samples": [
"Diag-excap01-123456789"
]
}
Internally, an analysis must be in one of the following states: PENDING
, DEPENDENCIES
, RUNNING
, INVALID
, QCFAILED
, FAILED
or COMPLETE
.
If an analysis is imported, but it's listed dependencies are missing, it will be set to the DEPENDENCIES
state, meaning it is waiting for one or more dependencies to be fulfilled. Normally this happens if the analysis is imported before the samples it depends on.
If it is impossible to perform an analysis due to bad configuration, it will be set to INVALID
.
Repositories
The executor searches repositories for new data to add. You can set the paths to the sample and analysis repository individually in the configuration file. Normally one would structure a repository in the following way:
/path/to/repo
├── analyses
│ └── Diag-excap01-123456789
│ ├── Diag-excap01-123456789.analysis
│ └── READY
│ └── Diag-excap01-123456789-EEogPU_v02
│ ├── Diag-excap01-123456789-EEogPU_v02.analysis
│ └── READY
└── samples
└── Diag-excap01-123456789
├── 123456789-PCD-v01-KIT-Av5_GCCAAT_L007_R1_001_fastqc (optional for NovaSeq X)
├── 123456789-PCD-v01-KIT-Av5_GCCAAT_L007_R1_001.fastq.gz
├── 123456789-PCD-v01-KIT-Av5_GCCAAT_L007_R2_001_fastqc (optional for NovaSeq X)
├── 123456789-PCD-v01-KIT-Av5_GCCAAT_L007_R2_001.fastq.gz
├── Diag-excap01-123456789.sample
├── Diag-excap01-123456789.taqman
├── LIMS_EXPORT_DONE
└── READY
Results from the analysis will end up in a result/
folder inside the respective analysis, along with log files.
Database
All samples and analyses are tracked using a database. The database is assumed being a normal PostgreSQL database. See image below for schema overview.