Skip to content

HTS Bioinf - Execution and monitoring of pipeline

Scope

This procedure explains how to perform daily bioinformatic production, including monitoring and running the pipeline to analyse diagnostic HTS samples and inform the units of errors and delivery.

Responsibility

Responsible person: A qualified bioinformatician on production duty


Overview - production flow

Common terms and locations

  • Automation system - The system automatically launching the pipelines for the incoming analyses, tracking their status in it's own database. The name of the automation system project is vcpipe and the component responsible for running the analyses is named executor.
  • interpretations/- The directory for delivering samples to the lab.
  • production/ - The directory for production data. It contains several subdirectories:

  • production//data/analyses-work - analyses imported into the system;

  • production/data/samples on NSC and /ess/p22/archive/no-backup/samples on TSD - samples imported into the system;
  • production/sw - software used in production;
  • production/logs - logs from the automation system and it's UI;
  • production/tmp - if the system requires it (low disk space on /tmp), a tmp folder working for use with TMPDIR environment variable.
  • verifications folder on NSC - a location to place .bam (.bai) files as well as output of fingerprinting when a fingerprinting test failed on a WGS sample
  • serviceuser a non-personal user for running processes. A special script is used to log in as this user. See {serviceuser login}

This document refers to some common folders on the platforms where the pipeline is run. On NSC the pipeline relevant data is stored on the server boston, available as /boston/diag.

Term NSC TSD
durable - /ess/p22/data/durable
ella - /ess/p22/data/durable/production/ella
script-home /boston/diag/transfer/sw /ess/p22/data/durable/production/sw/automation/tsd-import/script
production /boston/diag/production /ess/p22/data/durable/production
transfer /boston/diag/transfer/production /ess/p22/data/durable/s3-api/production
ella-import-folder - 'ella'/ella-prod/data/analyses/imported

Planning production

The production coordinator, or, in the absence the bioinformatic coordinator (see procedure HTS Bioinf Group roles), sets up a planned schedule every quarter (production quarter plan), and appoints two trained bioinformaticians on production duties, one as the main responsible bioinformatician, the other one as the back-up bioinformatician when the main responsible bioinformatician is not available (e.g. sick). The plan should be updated on the webpage: https://gitlab.com/ousamg/docs/wiki/-/wikis/production/OnDuty_production when it is decided.

Start production

1. Registration of production duty

The main responsible bioinformatician should check whether the information about "Start Date", "Main responsible bioinfo." and "Back-up bioinfo." for the current production interval on the page: the schedule scheme are right. If the informations is not right, the main responsible bioinformatician should update the page with the right information. All the names are registered with Signaturkode (OUS user name).

2. Starting lims-exporter-api and nsc-exporter

  • In the NSC network, log into the server beta:

    ssh beta.ous.nsc.local
    {serviceuser login} beta
    
  • Start the lims-exporter-api:

    screen -dm -S lims-exporter {script-home}/lims-exporter-api.sh
    
  • Log into sleipnir from beta:

    ssh sleipnir
    
  • Start the nsc-exporter (as your personal user):

    screen -dm -S nsc-exporter {script-home}/nsc-exporter.sh
    

The setup of tacl for transfer of files to TSD requires 2FA which the service doens't have. The files that ends up in TSD's transfer ares are owned by nobody:p22-member-group regardless of the user that initiated the transfer.

3. Starting filelock-exporter-api

On TSD, log into p22-cluster-sync:

{serviceuser login} p22-cluster-sync

or p22-submit-dev when p22-cluster-sync is not available

screen -dm -S filelock-exporter /ess/p22/data/durable/production/sw/automation/filelock-exporter-api

4. Starting webui, executor and NSC sequencing overview page

On TSD:

  • log into p22-submit:

    {serviceuser}/login.sh p22-submit
    source {production}/sw/prod-screen-commands.txt
    prod-start-executor
    
  • log into p22-submit2:

    {serviceuser login} p22-submit2
    source {production}/sw/prod-screen-commands.txt
    prod-start-webui
    

The UI is running on port 8080, you can open the UI on a web browser at: e.g. http://p22-submit2:8080

  • log into p22-submit2 or other p22 submit nodes:

    {serviceuser login} p22-submit2
    source {production}/sw/prod-screen-commands.txt
    prod-start-nsc-overview
    

On NSC:

  • log into diag-executor :

    {serviceuser login} diag-executor.ous.nsc.local
    source {production}/sw/nsc-screen-commands.txt
    nsc-start-executor
    
  • log into diag-webui:

    {serviceuser login} diag-webui.ous.nsc.local
    source {production}/sw/nsc-screen-commands.txt
    nsc-start-webui
    

    The UI is running on port 1234, so you may want to forward that port to your local machine:

    ssh diag-webui.ous.nsc.local -L 1234:localhost:1234
    

Everyday duties

1. Re-login to ensure services health

  • Check that production services are up every week (Monday)

    Log into all VMs that run production services using {serviceuser}/login.sh. The table below shows suggested distribution of services. This may be slightly different if some VMs are not available at a time.

    VM service
    p22-app-01 anno
    p22-submit executor
    p22-submit2 webui, nsc-overview
    p22-submit-dev backup for other VMs
    p22-cluster-sync filelock
    p22-ella-01 ELLA
  • Login Kerberos key on TSD for individual users

    Access to either login or cluster nodes need a valid Kerberos ticket. An expired, invalid ticket will deny access.

    However serviceuser lacks this ticket and therefore this issue is only applicable to individual users.

    Run klist in the command line for the login node on TSD. If the key is close to being outdated, re-login to the server to update the key.

2. Check status of sample processing

While a project is in progress, check the automation system user interface (webui) a few times during the day for any failing analyses or other anomalies.

Check that the samples/analyses are imported into the system and that they start running. Compare the number of new analyses in the system to the number expected from the new project. Note that if a sample is referred to 'custom' or 'Baerer' gene panel, there will be no annopipe analysis folder for the sample.

For the EKG Blank sample, if the number of reads is over the threshold, only Blank sample will be processed by the lims-exporter-api, the other samples from the same project will be sent to Manage review in Clarity. The blank sample will be analyzed first and the results from both basepipe and annopipe need to be force delivered from the UI. EKG needs to be informed when the results are ready and they will decide whether to continue with other sample analyses or not.

If there are any failing analyses or other anomalies in the processing, see exceptions below for detail. If it is not covered, investigate the issue (see below) and raise the issue with the bioinformatics group immediately (and - if relevant - with other concerned parties).

3. Failing QC or Failing SNP fingerprinting test

If the QC fails, the pipeline will continue running and the analysis is marked as "QC Failed". Investigate the criteria that failed by looking in the "QC" of the analysis on the UI. The basepipe and triopipe processed data will not be copied to the corresponding preprocessed directories.

Samples/runs that fail QC must be discussed with the lab. The solutions for some typical scenarios are described in the procedure HTS - Samples that fail QC in bioinformatic pipeline. Relevant details must be documented in Clarity. If the final decision is to deliver results, it can be done by click "Deliver analysis" under "Post" of the analysis on the UI.

If the SNP fingerprinting test fails, the pipeline will stop running and the analysis is marked as "Failed". Investigate how many samples that are affected and whether the cause is because of too low number of sequencing reads (especially for EKG target sequencing samples). If the number of mismatch sites are larger than 3, if other analyses from the same project are still running, you may have to wait for them to complete.

If the lab wants to create a new taqman/fingerprint file and re-run the pipeline, you must add the sample name (Diag-Project-sampleID) in the cleaning whitelists (TSD: {production}/sw; NSC: /boston/diag/transfer/sw). This will keep the sample files on NSC and TSD and LIMS Exporter will automatically pick up the analyses when the lab adds it back to the LIMS Exporter queue in Clarity.

If a new TaqMan SNP-ID result is generated, make sure the updated TaqMan SNP-ID result is copied to production/samples/{sample_name} folder (usually taken care of by the filelock), and then re-run the analysis through the webUI (by clicking "Resume analysis").

Generate a summary on failed and passed samples. Inform the lab as soon as possible. See procedure HTS – Mismatch between TaqMan SNP-ID and sequencing data for further investigations. If a new TaqMan SNP-ID result is generated, make sure the updated TaqMan SNP-ID result is copied to production/samples/{sample_name} folder, and then re-run the analysis through the webUI.

4. Result delivery and Clarity update

When pipeline results for a project are ready, the actions to take differ depending on which unit owns a project. The result will normally be imported automatically into ELLA a few hours after the pipeline has succesfully finised.

The sample can be moved to next step in Clarity only when: - the analysis on the sample is in ELLA database, - or if no analysis (e.g. gene panel) is referred to the sample, the sample needs to be available in sample-repo-prod/samples.json

After confirming that the samples are in ELLA database (see HTS Bioinf - ELLA core production for how to go into ELLA database), send delivery email to the respective unit:

  • EKG ('EKG' in the project name): EKG mailing list at OUS and diag-lab and bioinf-prod at UiO
  • EGG ('excap' or 'wgs' in the project name): diag-lab and bioinf-prod at UiO
  • EHG ('EHG' in the project name or the sample is referred to Cardio gene panel): ehg-hts and bioinf-prod at UiO

When there are errors, send email to the same recipients.

For the priority2 and 3 samples, once the samples are in ELLA, a delivery email should be sent as soon as possible.

5. Handling control samples

When a control sample (sample NA12878, HG002, HG003, HG004) is in the project, the trend analyses should be carried out by the the bioinformatician in production duty as described in the procedure HTS - Use of NA samples for quality control.

6. Reanalysis of samples

All reanalyses in lims-exporter queue in Clarity need to rerun the basepipe pipeline in order to either get basepipe results available or make the analysis be consistent with that of the new sequenced samples.

All the analyses folders will be automatically generated by lims-exporter-api.

If the reanalysis type is not a hybrid trio, please confirm that the folder is not available in analyses-results/singles or analyses-results/trios. For that, check whether the analysis name is available in /ess/p22/data/durable/production/anno/sample-repo-prod/samples.json as an entry. If it is in the samples.json, notify the lab that the sample could be reanalysed directly from ELLA. If it is not or it needs the rerun of basepipe (for these cases, see the procedure HTS - Custom genpanel og reanalyse av sekvensdata), then do the following steps:

  • An empty READY file is required to be located in the individual sample folder and the folder needs the full permission for the group p22-member-group.
  • The sample folder archive is located on TSD: /ess/p22/archive/no-backup/samples.

  • If the basepipe or the triopipe results existed in /ess/p22/data/durable/production/data/analyses-results/singles or /ess/p22/data/durable/production/data/analyses-results/trios, the folder name containing the previous results need to be changed from {folder_name} to {folder_name}-YYYY-MM-DD, e.g. {folder_name}-2022-02-15, so the previous results will not be overwritten. Do NOT use the cpSample.py script until vcpipe-utilities v1.0.0 is released and deployed on TSD/NSC You can run the following command to print out the move commands: python3 /ess/p22/data/durable/production/sw/utils/vcpipe-utilites/src/production/cpSample.py

    You can add --blacklist for the project name for any ignored samples, e,g, --blacklist wgs158,wgs159 for skipping all analysis in wgs158, wgs159)

  • If the basepipe, triopipe or annopipe analysis have been registered in the production database, the executor will not start the analysis automatically. One needs to start the analysis through webui.

7. Cleaning cluster and NSC

It is important to daily check (df -h) for storage capacity. There should be more than 25T free space on TSD and more than 150T free space (approx. 70% full) on NSC. Check the following locations:

  • TSD: /ess/p22/data/durable/production/
  • TSD: the directory defined by DURABLE_PROD_REPO_PATH at /ess/p22/data/durable/production/sw/automation/tsd-import/src/filelock_exporter_api/filelock_exporter_api.py
  • NSC: /boston/diag/

If the cleaning is needed, follow the instruction in Cleaning the disk spaces.

8. Data compression

The bam files of exome and genome samples need to be compressed. See the procedure HTS Bioinf - Storage and security of sensitive data for how and when to do compression.

9. Update the annotation databases

If needed, the main responsible bioinformatician updates the annotation databases according to the procedures in HTS Bioinf - Update public databases. In case they are busy with productions, the back-up bioinformatician can be asked to do it.

10. Lab stand-up

Both the main responsible bioinformatician and the back-up bioinformatician should attend the lab stand-up at 10:50 every Monday and Thursday.

Finish production

Cleaning the disk spaces (Do NOT use the following steps until vcpipe-utilities v1.0.0 is released and deployed on TSD/NSC)

The scripts will clean the following locations:

On NSC:

  • /boston/diag/production/{samples,analyses-work,analyses-results/singles,analyses-results/trios}
  • /boston/diag/transfer/{normal,high,urgent}/{samples,analyses-results/singles,analyses-results/trios,ella-incoming}
  • /boston/diag/nscDelivery/RUN_DELIVERY/*fastq.gz (if there is no fastq.gz file under a RUN_DELIVERY folder, the whole RUN_DELIVERY folder could be deleted)

On TSD:

  • /ess/p22/data/durable/production/data/{samples,analyses-work,analyses-results/singles,analyses-results/trios}
  • /ess/p22/data/durable/s3-api/production/{normal,high,urgent}/{samples,analyses-work,analyses-results/singles,analyses-results/trios,ella-incoming}

Please do the cleaning based on the following order:

  • On TSD, you need to log into any p22-submit VMs and do module load {python3 with version} before running the scripts. (You could run module avail python3 to find which version is available on TSD).

  • Please check https://gitlab.com/ousamg/docs/wiki/-/wikis/production/how_to_communicate_TSD for how to transfer files between TSD and NSC server. When uploading, please add --group p22-diag-ous-bioinf-group at the end of the command, the files will be transferred to TSD at /ess/p22/data/durable/file-import/p22-diag-ous-bioinf-group.

PRODUCTION_SW=/ess/p22/data/durable/production/sw/utils # on TSD
PRODUCTION_SW=/boston/diag/production/sw/utils # on NSC
SCRIPT_LOCATION=${PRODUCTION_SW}/vcpipe-utilities/src/clean
FILE_EXPORT=/ess/p22/data/durable/file-export/dev/{USER_FOLDER}
FILE_IMPORT=/ess/p22/data/durable/file-import/p22-diag-ous-bioinf-group
  1. on TSD: Create file list json file on TSD (p22-submit-dev):
    python3 ${SCRIPT_LOCATION}/createFileList.py \
         --output createFileList_tsd.json > createFileList_tsd.log
  1. on NSC: Create file list json file on NSC (beta), the createFileList_nsc.json needs to be transferred to TSD:
    python3 ${SCRIPT_LOCATION}/createFileList.py \
        --output createFileList_nsc.json > createFileList_nsc.log
  1. on TSD: Create the deleting commands on TSD to be used on TSD:
    python3 ${SCRIPT_LOCATION}/createDeleteBash.py --input createFileList_tsd.json > createDeleteBash_tsd.bash
  1. on TSD: Create the deleting commands on NSC to be used on NSC, createDeleteBash_nsc.bash needs to be transferred to NSC /boston/diag/transfer/production/nsc_tsd_sync/clean/:

    python3 ${SCRIPT_LOCATION}/createDeleteBash.py \
        --input ${FILE_IMPORT}/createFileList_nsc.json > ${FILE_EXPORT}/createDeleteBash_nsc.bash
Please clean ${FILE_EXPORT}/createDeleteBash_nsc.bash when it has been transferred to NSC server.

  1. on NSC: Run the delete bash script from beta. Important: the script must be run on beta, not on sleipnir or it won't do the cleaning as expected.
    bash /boston/diag/transfer/production/nsc_tsd_sync/clean/createDeleteBash_nsc.bash
  1. on TSD: Run the delete bash script
    bash createDeleteBash_tsd.bash

Updating Clarity

Go through the steps 'lims', 'processing' and 'qc fail' and make sure that all samples are in the right stage in Clarity.

Production services

Only non-serviceuser run process needs to be stopped at the end of the production shift, i.e. only nsc-exporter process. You find how to stop this and all other services in the section below.

Transfer production duty

It is the responsibility of the main responsible bioinformatician to transfer the production duty to the next main responsible bioinformatician according to the production quarter plan. The two parties go through Clarity to make sure the knowledge is transferred and it is clear who will take care of which samples in the queue.

How to stop production services

Running the following command to have all the alias available:

on TSD: source /ess/p22/data/durable/production/sw/prod-screen-commands.txt

on NSC: source /boston/diag/production/sw/nsc-screen-commands.txt

1. Stop lims-exporter-api and nsc-exporter after confirming processes are currently sleeping

  • In the NSC network, log into the server beta: ssh beta.ous.nsc.local (ssh 192.168.1.41)
  • Stop the lims-exporter-api: If you're not the owner of the lims-exporter process: touch {transfer}/sw/kill-limsexporter Wait 5+ minutes after touching the kill file. Run screen -list to confirm that the process isn't running. Then remove the killfile. If you are the owner of the lims-exporter process: screen -X -S lims-exporter quit
  • Log into the sleipnir from beta: ssh sleipnir
  • Stop the nsc-exporter:
    • If you're not the owner of the nsc-exporter process: touch {transfer}/sw/kill-nscexporter Wait 5+ minutes after touching the kill file. Run screen -list to confirm that the process isn't running. Then remove the killfile.
    • If you are the owner of the nsc-exporter process: screen -X -S nsc-exporter quit

2. Stop filelock-exporter-api after confirming processes are currently sleeping

On TSD, log into the server:

ssh p22-cluster-sync # (or p22-submit-dev when p22-cluster-sync is not available)

If you're not the owner of the filelock-exporter process

touch production/sw/killfilelock

Wait 5+ minutes after touching the kill file. Run screen -list to confirm that the process isn't running. Then remove the killfile.

If you are the owner of the filelock-exporter process

screen -X -S filelock-exporter quit

3. Stop webui and executor (NSC sequencing overview page on TSD) after confirming processes are currently sleeping

Note! Skip the NSC part if the NSC pipeline isn't running or skip the TSD part if the TSD pipeline isn't running.

If you're not the owner of the production processes you must create kill files to have the processes stop themselves.

For TSD and NSC, run:

touch {production}/sw/kill{executor,webui}

Wait 5+ minutes after touching the kill file, by checking the executor log and the webui URL to confirm that the process isn't running, then remove the kill files.

Otherwise run the normal stop commands described below. Remember to remove the kill files before starting the processes.

On TSD:

  • log into p22-submit (ssh p22-submit):

    prod-stop-executor
    
    - log into p22-submit2 (ssh p22-submit2):

    prod-stop-webui
    
    - log into p22-submit2 (ssh p22-submit2) or other p22-submit nodes:

    prod-stop-nsc-overview
    

On NSC:

  • log into diag-executor (ssh diag-executor.ous.nsc.local):

    nsc-stop-executor
    
    - log into diag-webui (ssh diag-webui.ous.nsc.local):

    nsc-stop-webui
    

Errors and exceptions

How to redo demultiplexing

When sample names are wrong, or a new demultiplexing (re-demultiplexing) is needed for other reasons, proceed with the following steps in Clarity:

  • Go to "PROJECTS & SAMPLES", search for the project, open and find one of the samples in the sequencing run which needs demultiplexing
  • Open the sample, click requeue (the blue circle arrow aside the step) for the step "Demultiplexing and QC NSC", and click "Demultiplexing and QC NSC" to go into a new page
  • Click "Run" for "Auto 10: Copy run", when this is done, click "Run" for "Auto 20: Prepare SampleSheet", waiting until it is done, refresh the page in the browser (to make sure nothing is running in the background).
  • Click the file name in the "Demultiplexing sample sheet" under 'Files' to download the file, correct the information in this sample sheet and save in another file. Remove the file in "Demultiplexing sample sheet" by click cross and upload the file with correct information in here.
  • Remove the delivery folder which containing the wrong files under /boston/diag/nscDelivery, change step "Auto 90. Delivery and triggers" and "Close when finish" from 'Yes' to 'No', and click "save" on top
  • Click "Run" for the step "Auto 30. Demultiplexing", it will automatically continue until "Auto 90". For a genome sequencing run, it could take around 2 hours to finish these steps. Refresh the browser to see which step is finished.
  • After the above step is finished, check whether the files under /boston/diag/runs/demultiplexing/{RUN_FOLDER}/Data/Intensities/BaseCalls/{NSC_DELIVERY_FOLDER}, if they are right, change back step "Auto 90. Delivery and triggers" and "Close when finish" from 'No' to 'Yes' and click "save" on top. And click "Run" on "Auto 90. Delivery and triggers".

    Note, you don't need to click 'Run' on 'Close when finished'. The delivery folder will appear in /boston/diag/nscDelivery, the /boston/diag/runs/demultiplexing will be empty, the samples will appear in lims_exporter_api step in Clarity. If they are still wrong, talk to the production team for possible solution.

How to import specific samples

By default all samples in the lims-exporter step in Clarity will be exported. If you only want to export specific samples, stop lims-exporter-api and start it again with a combination of any of these options:

  • samples: only export these sample(s), e.g. 12345678910 or 12345678910,
  • projects: only export samples in these project(s), e.g. Diag-excap172-2019-10-28 or Diag-excap172-2019-10-28, Diag-wgs22-2019-04-05
  • priorities: only export samples with the following priorities, e.g 2 or 2,3

These options will be remembered. So to make lims-exporter-api export all samples again, you need to stop lims-exporter-api and restart it without any options and let it run continuously.

How to switch between using NSC pipeline or TSD pipeline

Samples-analyses can either target for TSD pipeline or NSC pipeline. The default pipeline is specified in /boston/diag/transfer/sw/lims_exporter_api.config file "default_platform" field. If the default pipeline is "NSC", exome and wgs low priority (priority 1) samples will still be sent to TSD pipeline due to limited capacity at NSC. EKG and EHG samples are very quick to run, so low priority ones can also run NSC pipeline.

Default pipeline can be overruled by adding project(s) and/or samples to the platform-whitelist-NSC.txt and platform-whitelist-TSD.txt files in /boston/diag/transfer/sw folder. In any of these two whitelist files, lines starting with # are treated as comment lines. The format is <project>[-<sampleid>], one per line, e.g.

_Diag-wgs72
Diag-wgs72-12345678901

If only project is given, all samples of the project will be included. Use only the part before the 2nd dash of a complete project name, e.g. Diag-EKG210216 instead of Diag-EKG210216-2021-02-16.

Reanalysis will always be targeted for TSD pipeline.

NSC pipeline results (preprocessed folder(s) and ella-incoming folder) are continuously and automatically transferred to TSD by nsc-exporter.

The majority of analysis are run on TSD, but in some cases analysis might need to be run on NSC. These are:

Sample types Priority Description
exome 2 and 3 Priority is given by LIMS or
communicated by the lab by other means.
target gene panel 1, 2 and 3 Captured by target gene panels
genome WGS Trio 2 and 3 Rapid genome / Hurtiggenom
(priority is given by LIMS or
through the lab).

Situations to consider when deciding:

  • The cluster is very busy (see cluster tools below)
  • VMWare login down (log in not possible)
  • /cluster disk not available
  • The VMs running the automation system are not available (p22-submit, p22-submit2 or p22-submit-dev)
  • The s3-api folders for transferring data are not available
  • scheduled maintenance
  • problems with licence for required tools (like Dragen)

Use the tools pending and qsumm to help decide on cluster capacity:

  • pending : gives an indication of when the jobs owned by the bioinformatician in production will be launched.
  • qsumm : gives an overview of the all jobs pending or being processed in the slurm queue.

If the queue is still full by the end of the day, then the samples should be run on the backup pipeline.

How to update S3 API Key

The long lived S3 API key must be updated yearly. This key was initially issued May 13rd 2020. Updated by yvastr 5 May 2023. Need to be updated before 5 May 2024.

The procedure for updating this key is:

  1. run this command to generate a new key:

    curl https://alt.api.tsd.usit.no/v1/p22/auth/clients/secret
        --request POST \
        -H "Content-Type: application/json" \
        --data '{
                "client_id": " _our client_id_ ",
                "client_secret": " _our current api key_ "
                }'
    

    Replace " our client_id " with what is in /ess/p22/data/durable/api-client.md file. Replace our current api key with the text in /boston/diag/transfer/sw/apikey_file.txt file.

    Above command will print a json string with 3 key value pairs:

    {
        "client_id": ...client id here...,
        "old_client_secret": ...old client_secret here...,
        "new_client_secret": ...new client_secret here...
    }
    
  2. Replace the text in /boston/diag/transfer/sw/s3api_key.txt with the "new_client_secret" value above.

Analysis is not imported into automation system

Start looking on the file system for:

  • whether the files for the analysis in question are in production/analyses/{analysis_name}
  • whether the corresponding sample(s) is in production/samples/{sample_names}
  • whether there is a file called READY in both the {sample_name} and {analysis_name} folders

If none of them is present, proceed to investigate the logs in the filelock system for any clues (normally located in production/logs/tsd-import/filelock-exporter/{datetime}.log). Create a Gitlab issue to describe the problem and further follow-up will be monitored there.

If all of them are present, proceed to investigate the logs in the automation system for any clues as to why they haven't been imported into the system (see New / unkown problem section below).

annopipe troubleshooting

Force delivery basepipe/triopipe QC failed sample will start annopipe, but the annopipe will have a 'QC failed' mark in the webui.

If annopipe is crashed, you could go into the corresponding nextflow work folder under the analysis folder. You can check STATUS file and find the crashed step. Then you can check the log file in the corresponding step folder.

If annopipe is run successfully (no matter whether it is QC failed), the post command will copy the required files to /ess/p22/data/durable/production/ella/ella-prod/data/analyses/incoming.

If the sample did not appear in ella database, e.g. the sample folder was not moved from incoming folder to imported folder, you could check the following:

  • whether ella-production:ella-production-watcher is still running on the Supervisor page (p22-vc-ui-l:9001)
  • the log file under /ess/p22/data/durable/production/ella/ella-prod/logs/prod-watcher.log (prefer not checking logs from Supervisor page)
  • view /ess/p22/data/durable/production/ella/ops-non-singularity/supervisor/run-prod-watcher.sh in a text editor to see whether the sample is excluded from importing to ella

New / unknown problem

Start with looking at the analysis' log file. Normally, these are available in the automation system's UI, but in some cases the log in the database could be empty. In such a case, identify the analysis on the file system, and look for the log file in its result folder: production/data/analyses-work/{analysis_name}/result/{datetime_of_result}/logs/stdout.log. It is also helpful to check whether the number of sequencing reads is too low for the analysis.

If that log doesn't contain any information, there's likely been a problem starting the analysis. Look into the log of the automation system, normally located in production/sw/logs/variantcalling/vcpipe/executor.log. grep for the analysis name to try to find the relevant section of the log, and if possible, check that the start time of the analysis in the UI matches the timestamp in the log.

To investigate the data output of the pipeline, look inside production/data/analyses-work/{analysis_name}/result/{datetime_of_result}/data/.

Create a Gitlab issue to describe the observed problem and consult with the rest of the bioinformatics group to find a resolution. All the follow-up will be monitored there.

How to convert bam files to fastq files

For some old samples, if the sample folder is not located in the sample folder archive, the sample folder need to be created manually. The original sequencing data needs to be converted from the bam file used in variant calling from the original analysis (file.bam) by using the following commands:

  1. Run RevertSam to convert bam to sam

    picard RevertSam
        I=file.bam
        O=file.sam
        RESTORE_ORIGINAL_QUALITIES=true
        SORT_ORDER=coordinate
        CREATE_INDEX=true
    

    If the bam file has been already compressed to cram file, one should perform the following command before the above command:

    samtools view -b -o file.bam -T GENOME_REFERENCE file.cram
    

    The GENOME_REFERENCE should be the one used for compressing the bam file. 2. Then convert the sam file into fastq

    picard SamToFastq
        I=sam.file
        FASTQ=file.R1.fastq
        SECOND_END_FASTQ=file.R2.fastq
        UNPAIRED_FASTQ=file.R0.fastq
    

The file.R1.fastq and file.R2.fastq are corresponding read1 fastq file and read2 fastq file for the sample.

The sample configuration file (an example is attached and the required fields is described in How to update S3 API Key) needs to be created under the individual sample folder and the fastq files, quality control results (fastqc folders) need to be copied into the individual sample folder, as well. The structure of the folder is described in How to switch between using NSC pipeline or TSD pipeline.

Background

  1. sleipnir

    Sleipnir is the dedicated transfer server. It's a mostly locked down server, only connected to the file-lock on TSD by a dedicated, locked down network channel. It only has access to /boston/diag/transfer. 2. lims-exporter-api

    The lims-exporter-api exports samples from Clarity using Clarity API ordered following the order of priority, creating samples/analyses inside a given repository.

    The result will look like the following structure:

    repo-analysis
      └── Diag-excap01-
        └── Diag-excap01-123456789.analysis
      └── Diag-excap01-123456789-EEogPU-v
        └── Diag-excap01-123456789-EEogPU-v02.analysis
    
    repo-sample
      └── Diag-excap01-
      ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R1_001_fastqc.tar
      ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R2_001_fastqc.tar
      ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R1_001.fastq.gz
      ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R2_001.fastq.gz
      ├── Diag-excap01-123456789.sample
     └── LIMS_EXPORT_DONE
    
    • Sample and analyses will be continuously exported from Clarity automatically by the lims-exporter-api.
    • Lims-exporter-api exports all high priority samples.
    • Lims-exporter-api will not export low priority samples when there are high priority samples to be exported.
    • When there are no high priority samples to export, lims-exporter-api exports low priority samples little by little. This is to avoid the case when many low priority samples occupy nsc-exporter for too long and upcoming high priority samples are delayed.
    • The taqman-source is needed for single samples, as the TaqMan files are searched for a file containing the sample id, then parsed and a fingerprint specific for the sample is created along the fastq.gz data files.
    • The fastq.gz files are hardlinked from the original (to avoid copying).
    • The log file is stored under /boston/diag/transfer/sw/logsAPI and a new file will be generated every time when the lims-exporter-api is restarted.
  2. The required fields in sample and analysis configuration file (.sample file and .analysis file)

    Required information in the sample configuration file (.sample files): - lane: in the sample fastq file name, e.g. “5" - reads under stats: obtained by counting the number of lines in the sample fastq files and divided by 4 - q30_bases_pct under stats: obtained from file Demultiplex_Stats.htm under the run folder - sequencer_id: in the NSC delivery folder, e.g. “D00132" - flowcell_id: in the sample qc report file name, e.g. “C6HJJANXX" - all the information under reads

    The *path* should be the sample fastq file name.
    
    The `md5` is calculated by typing the following command in the terminal:
    
    ```bash
    md5sum FASTQ_FILE NAME
    ```
    
    The *size* is calculated by typing the following command in the terminal:
    
    ```bash
    ls –l FASTQ_FILE_NAME
    ```
    
    use the number before the date.
    
    • project: in the NSC delivery folder, e.g. “Diag-excap41"
    • project_date: in the NSC delivery folder, e.g. “2015-03-27"
    • flowcell: in the NSC delivery folder, e.g. “B"
    • sample_id: in the sample fastq file name, e.g. “12345678910"
    • capturekit: converted from the information in the sample fastq file name, e.g. Av5 is converted to “agilent_sureselect_v05", wgs to "wgs"
    • sequence_date: in the NSC delivery folder, e.g. “2015-04-14"
    • name: combined project and sample_id delimited by symbol -, e.g. “Diag-excap41-12345678910"
    • taqman: the file name containing SNP fingerprinting taqman results

    Required information in the analysis configuration file (.analysis files): - name:

    **basepipe**: combined “project" and “sample_id" delimited by `-`, e.g. “Diag-excap41`-`12345678910"
    
    **triopipe**: combined “project", “sample_id" and "TRIO" delimited by `-`, e.g. “Diag-excap41`-`12345678910`-`TRIO"
    
    **annopipe**: combined “project", “sample_id", "TRIO", gene panel name and gene panel version delimited by `-`, e.g. “Diag-excap41`-`12345678910`-`TRIO`-`Mendel`-`v01"
    
    • samples:

      basepipe: only one sample

      triopipe and annopipe: three samples in trio

      The sample name should be the same as the name in the corresponding .sample file. - type:

      basepipe: basepipe

      triopipe: triopipe

      annopipe: annopipe

    • taqman in params: equals to false, only in basepipe .analysis file

    • pedigree in params: only in triopipe and annopipe .analysis file. For each proband, father and mother, the sample and gender (male or female) needs to be specified. The sample should be the same as the name in the corresponding .sample file.
    • genepanel in params: only in annopipe .analysis file, combined gene panel name and gene panel version delimited by _, e.g.“Mendel_v01"
  3. nsc-exporter

    The nsc-exporter transfers samples, analyses, preprocessed (produced by NSC pipeline) and ella-incoming (produced by NSC pipeline) from NSC at:

    /boston/diag/transfer/production/{urgent,high,normal}/{analyses,samples,preprocessed/{singles,trio},ella-incoming}/
    

    to TSD s3api endpoint at /tsd/p22/data/durable/s3api/. The nsc-exporter can run continuously and is priority based meaning that urgent data are transferred before normal priority data. The log file is stored under /boston/diag/transfer/sw/logsAPI and a new file will be generated at the beginning of every month.

    The nsc-exporter can be in different states: - Stopped - indicated by marker file /boston/diag/transfer/sw/NSC-EXPORTER-STOPPED. This file is touched by nsc-exporter when it is stopped. - Running and busy - indicated by marker file /boston/diag/transfer/sw/NSC-EXPORTER-ACTIVE. This file is touched by nsc-exporter when it is transferring data to TSD and removed when it is done. - Running and idle - indicated by no marker files, i.e., neither of the above 2 marker files exists.

  4. Transfering data from NSC to TSD

    The default route for production is

    lims-exporter-api.sh + nsc-exporter.sh + filelock-exporter-api.sh

    Features: fully automated; priority based; backup pipelines automated; transfering backup pipeline results to TSD is also automated; This uses S3 API to transfer data to TSD; data are written to s3 api endpoint.

  5. Strategy to choose a single sample in reanalysis by lims-exporter-api when multiple samples match the sample ID

    The samples in the following projects will be ignored: - a predefined list of projects (e.g. test projects for testing various lab settings) - projects containing 'test' - reanalyse projects - whole genome sequencing projects - projects with improper name (should be in the format: Diag-excapXX-YYYY-MM-DD)

    If there are still multiple samples: - choose the samples captured with Av5 KIT - choose samples in the latest project.

    If there are still multiple samples matching or no samples were found, the lims-exporter-api will send the sample to 'Manager review'.

    When sending to 'Manager review', lims-exporter-api will include the whole list of projects and samples to help the lab find the correct one.

  6. Order of reanalysis

    The request from lab engineers should contain the following information: - The sample ID in the previous analysis (for the sample ID before using Verso, the first nine digitals is enough) - The referred gene panel name in the reanalysis - The analysis type (trio or single) and the proband gender if the analysis type is trio

Other documents

  • HTS - Overordnet beskrivelse av arbeidsflyt
  • HTS Bioinf - Basepipe pipeline
  • HTS Bioinf - Trio pipeline
  • HTS Lab - TaqMan SNP-ID
  • HTS - Mismatch between TaqMan SNP-ID and sequencing data
  • HTS - Samples that fail QC in bioinformatic pipeline
  • HTS - Use of NA samples for quality control