HTS Bioinf - Execution and monitoring of pipeline
Scope
This procedure explains how to perform daily bioinformatic production, including monitoring and running the pipeline to analyse diagnostic HTS samples and inform the units of errors and delivery.
Responsibility
Responsible person: A qualified bioinformatician on production duty
Overview - production flow
Common terms and locations
- Automation system - The system automatically launching the pipelines for the incoming analyses, tracking their status in it's own database. The name of the automation system project is
vcpipe
and the component responsible for running the analyses is namedexecutor
. interpretations/
- The directory for delivering samples to the lab.-
production/
- The directory for production data. It contains several subdirectories: -
production//data/analyses-work
- analyses imported into the system; production/data/samples
on NSC and/ess/p22/archive/no-backup/samples
on TSD - samples imported into the system;production/sw
- software used in production;production/logs
- logs from the automation system and it's UI;production/tmp
- if the system requires it (low disk space on/tmp
), a tmp folder working for use withTMPDIR
environment variable.verifications
folder on NSC - a location to place.bam
(.bai
) files as well as output of fingerprinting when a fingerprinting test failed on a WGS sampleserviceuser
a non-personal user for running processes. A special script is used to log in as this user. See {serviceuser login}
This document refers to some common folders on the platforms where the pipeline is run.
On NSC the pipeline relevant data is stored on the server boston
, available as /boston/diag
.
Term | NSC | TSD |
---|---|---|
durable | - | /ess/p22/data/durable |
ella | - | /ess/p22/data/durable/production/ella |
script-home | /boston/diag/transfer/sw | /ess/p22/data/durable/production/sw/automation/tsd-import/script |
production | /boston/diag/production | /ess/p22/data/durable/production |
transfer | /boston/diag/transfer/production | /ess/p22/data/durable/s3-api/production |
ella-import-folder | - | 'ella'/ella-prod/data/analyses/imported |
Planning production
The production coordinator, or, in the absence the bioinformatic coordinator (see procedure HTS Bioinf Group roles), sets up a planned schedule every quarter (production quarter plan), and appoints two trained bioinformaticians on production duties, one as the main responsible bioinformatician, the other one as the back-up bioinformatician when the main responsible bioinformatician is not available (e.g. sick). The plan should be updated on the webpage: https://gitlab.com/ousamg/docs/wiki/-/wikis/production/OnDuty_production when it is decided.
Start production
1. Registration of production duty
The main responsible bioinformatician should check whether the information about "Start Date", "Main responsible bioinfo." and "Back-up bioinfo." for the current production interval on the page: the schedule scheme are right. If the informations is not right, the main responsible bioinformatician should update the page with the right information. All the names are registered with Signaturkode (OUS user name).
2. Starting lims-exporter-api and nsc-exporter
-
In the NSC network, log into the server
beta
: -
Start the lims-exporter-api:
-
Log into
sleipnir
frombeta
: -
Start the
nsc-exporter
(as your personal user):
The setup of tacl for transfer of files to TSD requires 2FA which the service doens't have.
The files that ends up in TSD's transfer ares are owned by nobody:p22-member-group
regardless of the user that initiated the transfer.
3. Starting filelock-exporter-api
On TSD, log into p22-cluster-sync:
or p22-submit-dev
when p22-cluster-sync
is not available
screen -dm -S filelock-exporter /ess/p22/data/durable/production/sw/automation/filelock-exporter-api
4. Starting webui, executor and NSC sequencing overview page
On TSD:
-
log into p22-submit:
-
log into p22-submit2:
The UI is running on port 8080
, you can open the UI on a web browser at: e.g. http://p22-submit2:8080
-
log into
p22-submit2
or otherp22
submit nodes:
On NSC:
-
log into diag-executor :
-
log into diag-webui:
{serviceuser login} diag-webui.ous.nsc.local source {production}/sw/nsc-screen-commands.txt nsc-start-webui
The UI is running on port
1234
, so you may want to forward that port to your local machine:
Everyday duties
1. Re-login to ensure services health
-
Check that production services are up every week (Monday)
Log into all VMs that run production services using
{serviceuser}/login.sh
. The table below shows suggested distribution of services. This may be slightly different if some VMs are not available at a time.VM service p22-app-01 anno p22-submit executor p22-submit2 webui, nsc-overview p22-submit-dev backup for other VMs p22-cluster-sync filelock p22-ella-01 ELLA -
Login Kerberos key on TSD for individual users
Access to either login or cluster nodes need a valid Kerberos ticket. An expired, invalid ticket will deny access.
However
serviceuser
lacks this ticket and therefore this issue is only applicable to individual users.Run
klist
in the command line for the login node on TSD. If the key is close to being outdated, re-login to the server to update the key.
2. Check status of sample processing
While a project is in progress, check the automation system user interface (webui
) a few times during the day for any failing analyses or other anomalies.
Check that the samples/analyses are imported into the system and that they start running. Compare the number of new analyses in the system to the number expected from the new project. Note that if a sample is referred to 'custom' or 'Baerer' gene panel, there will be no annopipe analysis folder for the sample.
For the EKG Blank sample, if the number of reads is over the threshold, only Blank sample will be processed by the lims-exporter-api, the other samples from the same project will be sent to Manage review in Clarity. The blank sample will be analyzed first and the results from both basepipe and annopipe need to be force delivered from the UI. EKG needs to be informed when the results are ready and they will decide whether to continue with other sample analyses or not.
If there are any failing analyses or other anomalies in the processing, see exceptions below for detail. If it is not covered, investigate the issue (see below) and raise the issue with the bioinformatics group immediately (and - if relevant - with other concerned parties).
3. Failing QC or Failing SNP fingerprinting test
If the QC fails, the pipeline will continue running and the analysis is marked as "QC Failed". Investigate the criteria that failed by looking in the "QC" of the analysis on the UI. The basepipe and triopipe processed data will not be copied to the corresponding preprocessed directories.
Samples/runs that fail QC must be discussed with the lab. The solutions for some typical scenarios are described in the procedure HTS - Samples that fail QC in bioinformatic pipeline. Relevant details must be documented in Clarity. If the final decision is to deliver results, it can be done by click "Deliver analysis" under "Post" of the analysis on the UI.
If the SNP fingerprinting test fails, the pipeline will stop running and the analysis is marked as "Failed". Investigate how many samples that are affected and whether the cause is because of too low number of sequencing reads (especially for EKG target sequencing samples). If the number of mismatch sites are larger than 3, if other analyses from the same project are still running, you may have to wait for them to complete.
If the lab wants to create a new taqman/fingerprint file and re-run the pipeline, you must add the sample name (Diag-Project-sampleID) in the cleaning whitelists (TSD: {production}/sw; NSC: /boston/diag/transfer/sw). This will keep the sample files on NSC and TSD and LIMS Exporter will automatically pick up the analyses when the lab adds it back to the LIMS Exporter queue in Clarity.
If a new TaqMan SNP-ID result is generated, make sure the updated TaqMan SNP-ID result is copied to production/samples/{sample_name} folder (usually taken care of by the filelock), and then re-run the analysis through the webUI (by clicking "Resume analysis").
Generate a summary on failed and passed samples. Inform the lab as soon as possible. See procedure HTS – Mismatch between TaqMan SNP-ID and sequencing data for further investigations. If a new TaqMan SNP-ID result is generated, make sure the updated TaqMan SNP-ID result is copied to production/samples/{sample_name}
folder, and then re-run the analysis through the webUI.
4. Result delivery and Clarity update
When pipeline results for a project are ready, the actions to take differ depending on which unit owns a project. The result will normally be imported automatically into ELLA a few hours after the pipeline has succesfully finised.
The sample can be moved to next step in Clarity only when: - the analysis on the sample is in ELLA database, - or if no analysis (e.g. gene panel) is referred to the sample, the sample needs to be available in sample-repo-prod/samples.json
After confirming that the samples are in ELLA database (see HTS Bioinf - ELLA core production for how to go into ELLA database), send delivery email to the respective unit:
- EKG ('EKG' in the project name):
EKG
mailing list at OUS anddiag-lab
andbioinf-prod
at UiO - EGG ('excap' or 'wgs' in the project name):
diag-lab
andbioinf-prod
at UiO - EHG ('EHG' in the project name or the sample is referred to Cardio gene panel):
ehg-hts
andbioinf-prod
at UiO
When there are errors, send email to the same recipients.
For the priority2 and 3 samples, once the samples are in ELLA, a delivery email should be sent as soon as possible.
5. Handling control samples
When a control sample (sample NA12878, HG002, HG003, HG004) is in the project, the trend analyses should be carried out by the the bioinformatician in production duty as described in the procedure HTS - Use of NA samples for quality control.
6. Reanalysis of samples
All reanalyses in lims-exporter queue in Clarity need to rerun the basepipe pipeline in order to either get basepipe results available or make the analysis be consistent with that of the new sequenced samples.
All the analyses folders will be automatically generated by lims-exporter-api.
If the reanalysis type is not a hybrid trio, please confirm that the folder is not available in analyses-results/singles
or analyses-results/trios
. For that, check whether the analysis name is available in /ess/p22/data/durable/production/anno/sample-repo-prod/samples.json
as an entry. If it is in the samples.json
, notify the lab that the sample could be reanalysed directly from ELLA. If it is not or it needs the rerun of basepipe (for these cases, see the procedure HTS - Custom genpanel og reanalyse av sekvensdata), then do the following steps:
- An empty
READY
file is required to be located in the individual sample folder and the folder needs the full permission for the groupp22-member-group
. -
The sample folder archive is located on TSD:
/ess/p22/archive/no-backup/samples
. -
If the basepipe or the triopipe results existed in
/ess/p22/data/durable/production/data/analyses-results/singles
or/ess/p22/data/durable/production/data/analyses-results/trios
, the folder name containing the previous results need to be changed from{folder_name}
to{folder_name}-YYYY-MM-DD
, e.g.{folder_name}-2022-02-15
, so the previous results will not be overwritten. Do NOT use the cpSample.py script until vcpipe-utilities v1.0.0 is released and deployed on TSD/NSC You can run the following command to print out the move commands:python3 /ess/p22/data/durable/production/sw/utils/vcpipe-utilites/src/production/cpSample.py
You can add
--blacklist
for the project name for any ignored samples, e,g,--blacklist wgs158,wgs159
for skipping all analysis inwgs158, wgs159
) -
If the basepipe, triopipe or annopipe analysis have been registered in the production database, the executor will not start the analysis automatically. One needs to start the analysis through webui.
7. Cleaning cluster and NSC
It is important to daily check (df -h
) for storage capacity. There should be more than 25T free space on TSD and more than 150T free space (approx. 70% full) on NSC. Check the following locations:
- TSD:
/ess/p22/data/durable/production/
- TSD: the directory defined by
DURABLE_PROD_REPO_PATH
at/ess/p22/data/durable/production/sw/automation/tsd-import/src/filelock_exporter_api/filelock_exporter_api.py
- NSC:
/boston/diag/
If the cleaning is needed, follow the instruction in Cleaning the disk spaces.
8. Data compression
The bam files of exome and genome samples need to be compressed. See the procedure HTS Bioinf - Storage and security of sensitive data for how and when to do compression.
9. Update the annotation databases
If needed, the main responsible bioinformatician updates the annotation databases according to the procedures in HTS Bioinf - Update public databases. In case they are busy with productions, the back-up bioinformatician can be asked to do it.
10. Lab stand-up
Both the main responsible bioinformatician and the back-up bioinformatician should attend the lab stand-up at 10:50 every Monday and Thursday.
Finish production
Cleaning the disk spaces (Do NOT use the following steps until vcpipe-utilities v1.0.0 is released and deployed on TSD/NSC)
The scripts will clean the following locations:
On NSC:
/boston/diag/production/{samples,analyses-work,analyses-results/singles,analyses-results/trios}
/boston/diag/transfer/{normal,high,urgent}/{samples,analyses-results/singles,analyses-results/trios,ella-incoming}
/boston/diag/nscDelivery/RUN_DELIVERY/*fastq.gz
(if there is nofastq.gz
file under aRUN_DELIVERY
folder, the wholeRUN_DELIVERY
folder could be deleted)
On TSD:
/ess/p22/data/durable/production/data/{samples,analyses-work,analyses-results/singles,analyses-results/trios}
/ess/p22/data/durable/s3-api/production/{normal,high,urgent}/{samples,analyses-work,analyses-results/singles,analyses-results/trios,ella-incoming}
Please do the cleaning based on the following order:
-
On TSD, you need to log into any
p22-submit
VMs and domodule load {python3 with version}
before running the scripts. (You could runmodule avail python3
to find which version is available on TSD). -
Please check https://gitlab.com/ousamg/docs/wiki/-/wikis/production/how_to_communicate_TSD for how to transfer files between TSD and NSC server. When uploading, please add
--group p22-diag-ous-bioinf-group
at the end of the command, the files will be transferred to TSD at/ess/p22/data/durable/file-import/p22-diag-ous-bioinf-group
.
PRODUCTION_SW=/ess/p22/data/durable/production/sw/utils # on TSD
PRODUCTION_SW=/boston/diag/production/sw/utils # on NSC
SCRIPT_LOCATION=${PRODUCTION_SW}/vcpipe-utilities/src/clean
FILE_EXPORT=/ess/p22/data/durable/file-export/dev/{USER_FOLDER}
FILE_IMPORT=/ess/p22/data/durable/file-import/p22-diag-ous-bioinf-group
- on TSD: Create file list json file on TSD (
p22-submit-dev
):
python3 ${SCRIPT_LOCATION}/createFileList.py \
--output createFileList_tsd.json > createFileList_tsd.log
- on NSC: Create file list json file on NSC (
beta
), thecreateFileList_nsc.json
needs to be transferred to TSD:
python3 ${SCRIPT_LOCATION}/createFileList.py \
--output createFileList_nsc.json > createFileList_nsc.log
- on TSD: Create the deleting commands on TSD to be used on TSD:
python3 ${SCRIPT_LOCATION}/createDeleteBash.py --input createFileList_tsd.json > createDeleteBash_tsd.bash
- on TSD: Create the deleting commands on NSC to be used on NSC,
createDeleteBash_nsc.bash
needs to be transferred to NSC/boston/diag/transfer/production/nsc_tsd_sync/clean/
:
python3 ${SCRIPT_LOCATION}/createDeleteBash.py \
--input ${FILE_IMPORT}/createFileList_nsc.json > ${FILE_EXPORT}/createDeleteBash_nsc.bash
${FILE_EXPORT}/createDeleteBash_nsc.bash
when it has been transferred to NSC server.
- on NSC: Run the delete bash script from
beta
. Important: the script must be run onbeta
, not onsleipnir
or it won't do the cleaning as expected.
- on TSD: Run the delete bash script
Updating Clarity
Go through the steps 'lims', 'processing' and 'qc fail' and make sure that all samples are in the right stage in Clarity.
Production services
Only non-serviceuser run process needs to be stopped at the end of the production shift, i.e. only nsc-exporter
process. You find how to stop this and all other services in the section below.
Transfer production duty
It is the responsibility of the main responsible bioinformatician to transfer the production duty to the next main responsible bioinformatician according to the production quarter plan. The two parties go through Clarity to make sure the knowledge is transferred and it is clear who will take care of which samples in the queue.
How to stop production services
Running the following command to have all the alias available:
on TSD: source /ess/p22/data/durable/production/sw/prod-screen-commands.txt
on NSC: source /boston/diag/production/sw/nsc-screen-commands.txt
1. Stop lims-exporter-api and nsc-exporter after confirming processes are currently sleeping
- In the NSC network, log into the server
beta
:ssh beta.ous.nsc.local (ssh 192.168.1.41)
- Stop the lims-exporter-api:
If you're not the owner of the lims-exporter process:
touch {transfer}/sw/kill-limsexporter
Wait 5+ minutes after touching the kill file. Runscreen -list
to confirm that the process isn't running. Then remove the killfile. If you are the owner of the lims-exporter process:screen -X -S lims-exporter quit
- Log into the
sleipnir
frombeta
:ssh sleipnir
- Stop the nsc-exporter:
- If you're not the owner of the nsc-exporter process:
touch {transfer}/sw/kill-nscexporter
Wait 5+ minutes after touching the kill file. Runscreen -list
to confirm that the process isn't running. Then remove the killfile. - If you are the owner of the nsc-exporter process:
screen -X -S nsc-exporter quit
- If you're not the owner of the nsc-exporter process:
2. Stop filelock-exporter-api after confirming processes are currently sleeping
On TSD, log into the server:
If you're not the owner of the filelock-exporter process
Wait 5+ minutes after touching the kill file. Run screen -list
to confirm that the process isn't running. Then remove the killfile.
If you are the owner of the filelock-exporter process
3. Stop webui and executor (NSC sequencing overview page on TSD) after confirming processes are currently sleeping
Note! Skip the NSC part if the NSC pipeline isn't running or skip the TSD part if the TSD pipeline isn't running.
If you're not the owner of the production processes you must create kill files to have the processes stop themselves.
For TSD and NSC, run:
Wait 5+ minutes after touching the kill file, by checking the executor log and the webui URL to confirm that the process isn't running, then remove the kill files.
Otherwise run the normal stop commands described below. Remember to remove the kill files before starting the processes.
On TSD:
-
log into p22-submit (
- log into p22-submit2 (ssh p22-submit
):ssh p22-submit2
): - log into p22-submit2 (ssh p22-submit2
) or other p22-submit nodes:
On NSC:
-
log into diag-executor (
- log into diag-webui (ssh diag-executor.ous.nsc.local
):ssh diag-webui.ous.nsc.local
):
Errors and exceptions
How to redo demultiplexing
When sample names are wrong, or a new demultiplexing (re-demultiplexing) is needed for other reasons, proceed with the following steps in Clarity:
- Go to "PROJECTS & SAMPLES", search for the project, open and find one of the samples in the sequencing run which needs demultiplexing
- Open the sample, click requeue (the blue circle arrow aside the step) for the step "Demultiplexing and QC NSC", and click "Demultiplexing and QC NSC" to go into a new page
- Click "Run" for "Auto 10: Copy run", when this is done, click "Run" for "Auto 20: Prepare SampleSheet", waiting until it is done, refresh the page in the browser (to make sure nothing is running in the background).
- Click the file name in the "Demultiplexing sample sheet" under 'Files' to download the file, correct the information in this sample sheet and save in another file. Remove the file in "Demultiplexing sample sheet" by click cross and upload the file with correct information in here.
- Remove the delivery folder which containing the wrong files under
/boston/diag/nscDelivery
, change step "Auto 90. Delivery and triggers" and "Close when finish" from 'Yes' to 'No', and click "save" on top - Click "Run" for the step "Auto 30. Demultiplexing", it will automatically continue until "Auto 90". For a genome sequencing run, it could take around 2 hours to finish these steps. Refresh the browser to see which step is finished.
-
After the above step is finished, check whether the files under
/boston/diag/runs/demultiplexing/{RUN_FOLDER}/Data/Intensities/BaseCalls/{NSC_DELIVERY_FOLDER}
, if they are right, change back step "Auto 90. Delivery and triggers" and "Close when finish" from 'No' to 'Yes' and click "save" on top. And click "Run" on "Auto 90. Delivery and triggers".Note, you don't need to click 'Run' on 'Close when finished'. The delivery folder will appear in
/boston/diag/nscDelivery
, the/boston/diag/runs/demultiplexing
will be empty, the samples will appear in lims_exporter_api step in Clarity. If they are still wrong, talk to the production team for possible solution.
How to import specific samples
By default all samples in the lims-exporter step in Clarity will be exported. If you only want to export specific samples, stop lims-exporter-api and start it again with a combination of any of these options:
- samples: only export these sample(s), e.g. 12345678910 or 12345678910,
- projects: only export samples in these project(s), e.g. Diag-excap172-2019-10-28 or Diag-excap172-2019-10-28, Diag-wgs22-2019-04-05
- priorities: only export samples with the following priorities, e.g 2 or 2,3
These options will be remembered. So to make lims-exporter-api export all samples again, you need to stop lims-exporter-api and restart it without any options and let it run continuously.
How to switch between using NSC pipeline or TSD pipeline
Samples-analyses can either target for TSD pipeline or NSC pipeline. The default pipeline is specified in /boston/diag/transfer/sw/lims_exporter_api.config
file "default_platform" field. If the default pipeline is "NSC", exome and wgs low priority (priority 1) samples will still be sent to TSD pipeline due to limited capacity at NSC. EKG and EHG samples are very quick to run, so low priority ones can also run NSC pipeline.
Default pipeline can be overruled by adding project(s) and/or samples to the platform-whitelist-NSC.txt
and platform-whitelist-TSD.txt
files in /boston/diag/transfer/sw
folder.
In any of these two whitelist files, lines starting with # are treated as comment lines. The format is <project>[-<sampleid>]
, one per line, e.g.
If only project is given, all samples of the project will be included. Use only the part before the 2nd dash of a complete project name, e.g. Diag-EKG210216
instead of Diag-EKG210216-2021-02-16
.
Reanalysis will always be targeted for TSD pipeline.
NSC pipeline results (preprocessed folder(s) and ella-incoming folder) are continuously and automatically transferred to TSD by nsc-exporter.
The majority of analysis are run on TSD, but in some cases analysis might need to be run on NSC. These are:
Sample types | Priority | Description |
---|---|---|
exome | 2 and 3 | Priority is given by LIMS or communicated by the lab by other means. |
target gene panel | 1, 2 and 3 | Captured by target gene panels |
genome WGS Trio | 2 and 3 | Rapid genome / Hurtiggenom (priority is given by LIMS or through the lab). |
Situations to consider when deciding:
- The cluster is very busy (see cluster tools below)
VMWare login
down (log in not possible)/cluster
disk not available- The VMs running the automation system are not available (
p22-submit
,p22-submit2
orp22-submit-dev
) - The
s3-api
folders for transferring data are not available - scheduled maintenance
- problems with licence for required tools (like
Dragen
)
Use the tools pending
and qsumm
to help decide on cluster capacity:
pending
: gives an indication of when the jobs owned by the bioinformatician in production will be launched.qsumm
: gives an overview of the all jobs pending or being processed in the slurm queue.
If the queue is still full by the end of the day, then the samples should be run on the backup pipeline.
How to update S3 API Key
The long lived S3 API key must be updated yearly. This key was initially issued May 13rd 2020. Updated by yvastr 5 May 2023. Need to be updated before 5 May 2024.
The procedure for updating this key is:
-
run this command to generate a new key:
curl https://alt.api.tsd.usit.no/v1/p22/auth/clients/secret --request POST \ -H "Content-Type: application/json" \ --data '{ "client_id": " _our client_id_ ", "client_secret": " _our current api key_ " }'
Replace " our client_id " with what is in
/ess/p22/data/durable/api-client.md
file. Replace our current api key with the text in/boston/diag/transfer/sw/apikey_file.txt
file.Above command will print a json string with 3 key value pairs:
-
Replace the text in
/boston/diag/transfer/sw/s3api_key.txt
with the "new_client_secret" value above.
Analysis is not imported into automation system
Start looking on the file system for:
- whether the files for the analysis in question are in production/analyses/{analysis_name}
- whether the corresponding sample(s) is in production/samples/{sample_names}
- whether there is a file called READY in both the {sample_name} and {analysis_name} folders
If none of them is present, proceed to investigate the logs in the filelock system for any clues (normally located in production/logs/tsd-import/filelock-exporter/{datetime}.log
). Create a Gitlab issue to describe the problem and further follow-up will be monitored there.
If all of them are present, proceed to investigate the logs in the automation system for any clues as to why they haven't been imported into the system (see New / unkown problem section below).
annopipe troubleshooting
Force delivery basepipe/triopipe
QC failed sample will start annopipe
, but the annopipe
will have a 'QC failed' mark in the webui.
If annopipe
is crashed, you could go into the corresponding nextflow work folder under the analysis folder. You can check STATUS
file and find the crashed step. Then you can check the log file in the corresponding step folder.
If annopipe
is run successfully (no matter whether it is QC failed), the post command will copy the required files to /ess/p22/data/durable/production/ella/ella-prod/data/analyses/incoming
.
If the sample did not appear in ella database, e.g. the sample folder was not moved from incoming
folder to imported
folder, you could check the following:
- whether ella-production:ella-production-watcher is still running on the Supervisor page (
p22-vc-ui-l:9001
) - the log file under
/ess/p22/data/durable/production/ella/ella-prod/logs/prod-watcher.log
(prefer not checking logs from Supervisor page) - view
/ess/p22/data/durable/production/ella/ops-non-singularity/supervisor/run-prod-watcher.sh
in a text editor to see whether the sample is excluded from importing to ella
New / unknown problem
Start with looking at the analysis' log file. Normally, these are available in the automation system's UI, but in some cases the log in the database could be empty. In such a case, identify the analysis on the file system, and look for the log file in its result folder: production/data/analyses-work/{analysis_name}/result/{datetime_of_result}/logs/stdout.log
. It is also helpful to check whether the number of sequencing reads is too low for the analysis.
If that log doesn't contain any information, there's likely been a problem starting the analysis. Look into the log of the automation system, normally located in production/sw/logs/variantcalling/vcpipe/executor.log
. grep
for the analysis name to try to find the relevant section of the log, and if possible, check that the start time of the analysis in the UI matches the timestamp in the log.
To investigate the data output of the pipeline, look inside production/data/analyses-work/{analysis_name}/result/{datetime_of_result}/data/
.
Create a Gitlab issue to describe the observed problem and consult with the rest of the bioinformatics group to find a resolution. All the follow-up will be monitored there.
How to convert bam files to fastq files
For some old samples, if the sample folder is not located in the sample folder archive, the sample folder need to be created manually. The original sequencing data needs to be converted from the bam file used in variant calling from the original analysis (file.bam) by using the following commands:
-
Run
RevertSam
to convertbam
tosam
picard RevertSam I=file.bam O=file.sam RESTORE_ORIGINAL_QUALITIES=true SORT_ORDER=coordinate CREATE_INDEX=true
If the
bam
file has been already compressed tocram
file, one should perform the following command before the above command:The
GENOME_REFERENCE
should be the one used for compressing thebam
file. 2. Then convert thesam
file intofastq
The file.R1.fastq
and file.R2.fastq
are corresponding read1 fastq file and read2 fastq file for the sample.
The sample configuration file (an example is attached and the required fields is described in How to update S3 API Key) needs to be created under the individual sample folder and the fastq files, quality control results (fastqc folders) need to be copied into the individual sample folder, as well. The structure of the folder is described in How to switch between using NSC pipeline or TSD pipeline.
Background
-
sleipnir
Sleipnir is the dedicated transfer server. It's a mostly locked down server, only connected to the file-lock on TSD by a dedicated, locked down network channel. It only has access to /boston/diag/transfer. 2. lims-exporter-api
The lims-exporter-api exports samples from Clarity using Clarity API ordered following the order of priority, creating samples/analyses inside a given repository.
The result will look like the following structure:
repo-analysis └── Diag-excap01- └── Diag-excap01-123456789.analysis └── Diag-excap01-123456789-EEogPU-v └── Diag-excap01-123456789-EEogPU-v02.analysis
repo-sample └── Diag-excap01- ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R1_001_fastqc.tar ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R2_001_fastqc.tar ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R1_001.fastq.gz ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R2_001.fastq.gz ├── Diag-excap01-123456789.sample └── LIMS_EXPORT_DONE
- Sample and analyses will be continuously exported from Clarity automatically by the lims-exporter-api.
- Lims-exporter-api exports all high priority samples.
- Lims-exporter-api will not export low priority samples when there are high priority samples to be exported.
- When there are no high priority samples to export, lims-exporter-api exports low priority samples little by little. This is to avoid the case when many low priority samples occupy nsc-exporter for too long and upcoming high priority samples are delayed.
- The taqman-source is needed for single samples, as the TaqMan files are searched for a file containing the
sample id
, then parsed and a fingerprint specific for the sample is created along thefastq.gz
data files. - The
fastq.gz
files are hardlinked from the original (to avoid copying). - The log file is stored under
/boston/diag/transfer/sw/logsAPI
and a new file will be generated every time when the lims-exporter-api is restarted.
-
The required fields in sample and analysis configuration file (
.sample
file and.analysis
file)Required information in the sample configuration file (
.sample
files): - lane: in the samplefastq
file name, e.g. “5" - reads under stats: obtained by counting the number of lines in the samplefastq
files and divided by 4 - q30_bases_pct under stats: obtained from fileDemultiplex_Stats.htm
under the run folder - sequencer_id: in the NSC delivery folder, e.g. “D00132" - flowcell_id: in the sample qc report file name, e.g. “C6HJJANXX" - all the information under readsThe *path* should be the sample fastq file name. The `md5` is calculated by typing the following command in the terminal: ```bash md5sum FASTQ_FILE NAME ``` The *size* is calculated by typing the following command in the terminal: ```bash ls –l FASTQ_FILE_NAME ``` use the number before the date.
- project: in the NSC delivery folder, e.g. “Diag-excap41"
- project_date: in the NSC delivery folder, e.g. “2015-03-27"
- flowcell: in the NSC delivery folder, e.g. “B"
- sample_id: in the sample
fastq
file name, e.g. “12345678910" - capturekit: converted from the information in the sample
fastq
file name, e.g. Av5 is converted to “agilent_sureselect_v05", wgs to "wgs" - sequence_date: in the NSC delivery folder, e.g. “2015-04-14"
- name: combined project and sample_id delimited by symbol
-
, e.g. “Diag-excap41-
12345678910" - taqman: the file name containing SNP fingerprinting taqman results
Required information in the analysis configuration file (
.analysis
files): - name:**basepipe**: combined “project" and “sample_id" delimited by `-`, e.g. “Diag-excap41`-`12345678910" **triopipe**: combined “project", “sample_id" and "TRIO" delimited by `-`, e.g. “Diag-excap41`-`12345678910`-`TRIO" **annopipe**: combined “project", “sample_id", "TRIO", gene panel name and gene panel version delimited by `-`, e.g. “Diag-excap41`-`12345678910`-`TRIO`-`Mendel`-`v01"
-
samples:
basepipe: only one sample
triopipe and annopipe: three samples in trio
The sample name should be the same as the name in the corresponding
.sample
file. - type:basepipe: basepipe
triopipe: triopipe
annopipe: annopipe
-
taqman in params: equals to
false
, only in basepipe.analysis
file - pedigree in params: only in triopipe and annopipe
.analysis
file. For each proband, father and mother, the sample and gender (male or female) needs to be specified. The sample should be the same as the name in the corresponding.sample
file. - genepanel in params: only in annopipe
.analysis
file, combined gene panel name and gene panel version delimited by_
, e.g.“Mendel_
v01"
-
nsc-exporter
The nsc-exporter transfers samples, analyses, preprocessed (produced by NSC pipeline) and
ella-incoming
(produced by NSC pipeline) from NSC at:/boston/diag/transfer/production/{urgent,high,normal}/{analyses,samples,preprocessed/{singles,trio},ella-incoming}/
to TSD s3api endpoint at
/tsd/p22/data/durable/s3api/
. The nsc-exporter can run continuously and is priority based meaning that urgent data are transferred before normal priority data. The log file is stored under/boston/diag/transfer/sw/logsAPI
and a new file will be generated at the beginning of every month.The nsc-exporter can be in different states: - Stopped - indicated by marker file
/boston/diag/transfer/sw/NSC-EXPORTER-STOPPED
. This file is touched by nsc-exporter when it is stopped. - Running and busy - indicated by marker file/boston/diag/transfer/sw/NSC-EXPORTER-ACTIVE
. This file is touched by nsc-exporter when it is transferring data to TSD and removed when it is done. - Running and idle - indicated by no marker files, i.e., neither of the above 2 marker files exists. -
Transfering data from NSC to TSD
The default route for production is
lims-exporter-api.sh + nsc-exporter.sh + filelock-exporter-api.sh
Features: fully automated; priority based; backup pipelines automated; transfering backup pipeline results to TSD is also automated; This uses S3 API to transfer data to TSD; data are written to s3 api endpoint.
-
Strategy to choose a single sample in reanalysis by lims-exporter-api when multiple samples match the sample ID
The samples in the following projects will be ignored: - a predefined list of projects (e.g. test projects for testing various lab settings) - projects containing 'test' - reanalyse projects - whole genome sequencing projects - projects with improper name (should be in the format: Diag-excapXX-YYYY-MM-DD)
If there are still multiple samples: - choose the samples captured with Av5 KIT - choose samples in the latest project.
If there are still multiple samples matching or no samples were found, the lims-exporter-api will send the sample to 'Manager review'.
When sending to 'Manager review', lims-exporter-api will include the whole list of projects and samples to help the lab find the correct one.
-
Order of reanalysis
The request from lab engineers should contain the following information: - The sample ID in the previous analysis (for the sample ID before using Verso, the first nine digitals is enough) - The referred gene panel name in the reanalysis - The analysis type (trio or single) and the proband gender if the analysis type is trio
Other documents
- HTS - Overordnet beskrivelse av arbeidsflyt
- HTS Bioinf - Basepipe pipeline
- HTS Bioinf - Trio pipeline
- HTS Lab - TaqMan SNP-ID
- HTS - Mismatch between TaqMan SNP-ID and sequencing data
- HTS - Samples that fail QC in bioinformatic pipeline
- HTS - Use of NA samples for quality control