Skip to content

HTS Bioinf - Execution and monitoring of pipeline

Scope

This procedure explains how to perform daily bioinformatic production, including monitoring and running the pipeline to analyse diagnostic High Throughput Sequencing (HTS) samples, and inform the units of eventual errors and delivery.

Responsibility

Responsible person: A qualified bioinformatician on production duty.

Overview

Common terms and locations

  • Automation system - The system automatically launching the pipelines for the incoming analyses and tracking their status in a dedicated database. The name of the automation system is vcpipe and the component responsible for running the analyses is named executor.
  • production/:   The folder for production data. It contains several subfolders:
    • data/analyses-results:   analyses results;
    • data/analyses-work:   analyses configuration and intermediate storage;
    • interpretations:   samples deliveries to the lab [TSD only];
    • logs:   logs from the automation system and its UI;
    • sw:   software used in production;
    • tmp:   if the system requires it (low disk space on /tmp), a temporary folder working for use with the TMPDIR environment variable.
  • serviceuser:   configuration files and scripts for logging in and running processes as a non-personal service user. See the "HTS Bioinf - Using service user" procedure.

NOTE: On TSD, the sequencing data are stored in /ess/p22/archive/no-backup/samples; On NSC they are temporarily stored on production/data/samples. \ Whenever a fingerprinting test fails on a WGS sample, the respective .bam and .bai files are copied to /boston/diag/diagInternal/verifications on NSC alongside the output of the fingerprinting test itself.

This document refers to some common folders on the platforms where the pipeline is run.

On NSC the data relevant for the pipeline are stored on the server boston, available at /boston/diag.

Term NSC TSD
durable - /ess/p22/data/durable
ella - /ess/p22/data/durable/production/ella
ella-import-folder - /ess/p22/data/durable/production/ella/ella-prod/data/analyses/imported
production /boston/diag/production /ess/p22/data/durable/production
script-home /boston/diag/transfer/sw /ess/p22/data/durable/production/sw/automation/tsd-import/script
transfer /boston/diag/transfer/production /ess/p22/data/durable/s3-api/production

Communication to the users

To streamline the communication to the users whenever there are issues with any of the GDx operational systems (including TSD), and to announce releases, we use the Microsoft Teams GDx operational channel 'GDx bioinformatikk - informasjon og kontaktskjema'.

Refer to this procedure for when and how to use the channel.

Planning production

The production coordinator, or, in their absence, the bioinformatics coordinator (see procedure HTS Bioinf Group roles), sets up a planned schedule every quarter (production quarter plan), and appoints two trained bioinformaticians on production duties, one as main responsible bioinformatician, the other one as backup bioinformatician to cover for the main responsible bioinformatician should the latter not be available (e.g. to due sickness). The plan should be updated on the web-page once established.

Starting production

1. Registration of production duty

The main responsible bioinformatician should check whether the information about "Start Date", "Main responsible bioinfo." and "Backup bioinfo." for the current production interval on the web page is correct. If the information is not correct, the main responsible bioinformatician should update the web page with the correct information. All the names are registered with Signaturkode (OUS user name).

It is up to the main responsible bioinformatician to transfer production duty to the next main responsible bioinformatician according to the production quarter plan. Create a production shift issue in Gitlab using the production_shift template and go through the checklist to make sure knowledge is transferred to the next responsible and it is clear who will take care of which samples in the queue.

Once the duty has been transferred, the new main responsible bioinformatician must make sure all production services are running.

If any service isn't running, refer to the section How to start production services below.

Everyday duties

1. Re-login to ensure services health

  • Check that production services are up every week (Monday)

Log into all VMs that run production services as the service user (except when checking nsc-exporter). The tables below shows the suggested distribution of services. This may be slightly different if some VMs are not available at the time.

VM on TSD service
p22-app-01 anno
p22-cluster-sync filelock
p22-ella-01 ELLA
p22-submit executor
p22-submit2 webui, nsc-overview
p22-submit-dev backup for other VMs
Server/VM on NSC service
beta lims-exporter-api
diag-db
diag-executor executor
diag-webui webui, nsc-overview
sleipnir nsc-exporter
  • Login Kerberos key on TSD for individual users

Access to the /ess/p22 file system normally requires a valid Kerberos ticket. An expired, invalid ticket will result in denied access for individual users (the service user is exempt from Kerberos authentication).

Run klist in the command line for all TSD servers which run production scripts (filelock-exporter-api, executor and webui), if the key is close to expiration, re-login to the server to update the key.

2. Check status of scheduled recurring jobs

We use Dagu to schedule and execute the recurring jobs, see HTS Bioinf - Scheduled recurring jobs. Check the web service running on Http://p22-app-01:8180 daily to catch any unexpected failures of these jobs. If a failure has occured, make sure it is followed up.

3. Check status of sample processing

While a project is in progress, check the automation system's user interface (webui) a few times during the day for any failing analyses or other anomalies.

Check that the samples/analyses are imported into the system and that they start being processed. Compare the number of new analyses in the system to the number expected from the new project. Note that if the "Gene panel" field is set to 'Baerer' or 'Custom' for a sample, there will be no annopipe analysis folder for it.

For EKG, if the "Blank" sample has more reads than the threshold, lims-exporter-api will only process that sample and send the other samples in the same project to "Manager review" in Clarity. Then the "Blank" sample will be analyzed, and the results from both basepipe and annopipe must be force deliviered from the UI. Inform EKG when the results are ready, they will decide whether to continue with the analysis of the other samples or not.

If there are any failing analyses or other anomalies in the processing, see exceptions below for details. If the observed symptoms are not covered, investigate the issue (see New/unknown problem) and raise the issue with the bioinformatics group immediately (and, if relevant, with other concerned parties).

4. Failing QC or Failing SNP fingerprinting test

If the QC fails, the pipeline will continue running. However, the analysis will be marked as "QC Failed" and the basepipe and triopipe results will not be copied to the corresponding analyses-results folders. Investigate the unmet criteria by looking in the "QC" of the analysis on the UI.

Runs that fail QC must be discussed with the lab. The solutions for some typical scenarios are described in the procedure "HTS - Samples that fail QC in bioinformatic pipeline" (eHåndbok, ID 128858). Relevant details must be documented in Clarity. If the final decision is to deliver results, this can be done by clicking "Deliver analysis" under "Post" for the analysis on the UI.

If the SNP fingerprinting test fails, the pipeline will stop running and the analysis will be marked as "Failed". Investigate how many samples are affected and whether the cause is too few sequencing reads (especially for EKG target sequencing samples). Should this be the case, notify the lab right away. If the number of mismatched sites is larger than 3 and other analyses from the same project are still running, you may have to wait for them to complete before proceeding with the investigation.

To find the number of reads, find the corresponding run on the "NSC sequencing overview" page and follow the "Demultiplexing" link. This will open a page in Clarity where you find information from the sequencing.

Generate a summary of failed and passed samples. Inform the lab as soon as possible. See procedure "HTS - Mismatch between TaqMan SNP-ID and sequencing data" (eHåndbok, ID 90647) for further possible courses of action.

If the lab wants to create a new TaqMan/fingerprint file and re-run the pipeline, you must add the sample name (Diag-<project>-<sampleID>) to the cleaning blacklists (TSD: {production}/sw; NSC: /boston/diag/transfer/sw). This will keep the sample files on NSC and TSD, and lims-exporter_api will automatically pick up the analyses when the lab adds them back to the queue in Clarity.

If a new TaqMan SNP-ID result is generated, make sure the updated TaqMan SNP-ID result is copied to the /ess/p22/archive/no-backup/samples/Diag-<project>-<sampleID> folder (usually taken care of by filelock-exporter-api) and then re-run the analysis through the webUI (by clicking "Resume analysis").

5. Result delivery and Clarity update

When the pipeline results for a project are ready, the actions to take differ depending on which unit owns the project. The result will normally be imported automatically into ELLA a few hours after the pipeline has successfully finished.

The sample can be moved to the next step in Clarity only when:

  • the analysis of the sample is in ELLA's database or
  • the sample is available in the sample repo JSON file (in case no analysis, e.g. annotation, refers to it)

After confirming that the samples are in ELLA's database (see HTS Bioinf - ELLA core production for how to access ELLA's database), respond to the Helpdesk ticket of the project delivery with the respective unit in "CC":

  • EKG ('EKG' in the project name): EKG@ous-hf.no
  • GDx ('excap' or 'wgs' in the project name): diag-lab@medisin.uio.no

The same recipients should be notified by email (via the Helpdesk ticket) also in case of errors.

For priority2 and priority3 samples, once the samples are in ELLA, a delivery email should be sent as soon as possible (via the Helpdesk ticket).

6. Handling control samples

When a control sample (sample NA12878, HG002, HG003, HG004) is in the project, the trend analyses should be carried out by the the bioinformatician in production duty as described in the procedure "HTS - Use of NA samples for quality control" (eHåndbok, ID 105870).

7. Reanalysis of samples

If basepipe results are not available or are available but not up-to-date across all samples in the analysis (e.g. they were produced using different versions of the pipeline), reanalysis of basepipe pipelines for all non-available or non-up-to-date samples will be triggered from Clarity through the lims-exporter-api's queue.

All analysis-work folders will be automatically generated by lims-exporter-api.

The analyses that require most tracing work are those involving hybrid trios, i.e. trios not all members of which are being sequenced at the same time. In order to confirm the successful execution of a reanalysis involving a hybrid trio it might be necessary to track all its members using the family ID. Note that the sample ID indicated in the name of the reanalysis may differ from the original one (Registered under the column "UDF/Reanalysis old sample ID Diag" in the sample sheet). The lims-exporter-api will find the right sample ID, otherwise it will send the sample to 'Manage Review' in the Clarity.

If the reanalysis does not involve a hybrid trio, check whether its results folder is present in the sample repo JSON file. If it is, notify the lab that the sample can be reanalysed directly from ELLA. If it is not, or if a basepipe re-run is warranted regardless (for these cases, see the procedure "HTS - Custom genpanel og reanalyse av sekvensdata" (eHåndbok, ID 121052)), the executor will find it in the sample folder archive located in /ess/p22/archive/no-backup/samples.

When the analysis-work folder transfers to TSD, the filelock will check whether the previous basepipe or triopipe results exists in {production}/data/analyses-results/singles or {production}/data/analyses-results/trios. If it is there, before moving the analysis-work folder to {production}/data/analyses-work/, the filelock will move the existing results to /ess/p22/archive/production/analyses-results/{singles,trios} and change the folder name from {folder_name} to {folder_name}-YYYY-MM-DD.

The only possible manual intervention will be:

  • If the basepipe, triopipe or annopipe analysis already is registered in the production database, it won't be started automatically by the executor and will have to be started manually through the webUI by clicking 'Resume analysis' under 'Admin' tab.

8. Disk space on TSD and NSC

It is important to daily check the storage capacity (df -h). There should be more than 150T free space on both TSD and NSC (approx. 70% full on NSC). Check the following locations:

  • TSD: /ess/p22/
  • NSC: /boston/diag/

Storage space availability on NSC is more challenging to estimate. Inspect the file /boston/disk-usage.txt to get a better overview.

If cleaning is needed, follow the instructions in the section Cleaning the workspace.

Data compression

In order to save space, the BAM files of exome and genome samples can be compressed. This is an ongoing effort. See the procedure HTS Bioinf - Storage and security of sensitive data for how and when to compress BAM files.

9. Update the annotation databases

If needed, the main responsible bioinformatician updates the annotation databases according to the procedures in HTS Bioinf - Update public databases. In case they are busy with production, the backup bioinformatician can be asked to do it.

Every half a year, the ELLA classifications database should also be updated according to the following instructions.

  • Locate the latest ELLA classifications VCF file under /ess/p22/data/durable/production/anno/sensitive-db/ella-classifications on TSD;
  • Update the /ess/p22/data/durable/production/anno/sensitive-db/ella-classifications.vcf symlink to point to it;
  • Export the same file to NSC and update the corresponding ella-classifications.vcf symlink there to point to it.

10. Lab stand-up meeting

Both the main responsible bioinformatician and the backup bioinformatician should attend the lab stand-up meeting at 10:50 every Monday and Thursday.

Finishing production

Cleaning the workspace

The following are the locations relevant for this task:

FILE_EXPORT=/ess/p22/data/durable/file-export/dev/{USER_FOLDER}
FILE_IMPORT=/ess/p22/data/durable/file-import/p22-diag-ous-bioinf-group
SCRIPT_LOCATION=/ess/p22/data/durable/production/sw/utils/vcpipe-utilities/src/clean  # on TSD
SCRIPT_LOCATION=/boston/diag/production/sw/utils/vcpipe-utilities/src/clean  # on NSC
ARCHIVE_LOCATION=/ess/p22/data/durable/production/sw/utils/vcpipe-utilities/src/clean/archive #on TSD
ARCHIVE_LOCATION=/boston/diag/transfer/clean/archive #on NSC

Tip

Check the wiki for how to transfer files between TSD and NSC. When uploading files, add --group p22-diag-ous-bioinf-group to the tacl command, the files will be transferred to TSD at /ess/p22/data/durable/file-import/p22-diag-ous-bioinf-group.

Perform the cleaning in the following order:

Note

In the following steps, the file list JSON file and the deleting commands will be generated in the location from where you run the scripts, unless you specify the path as well.

  1. on TSD: log into any p22-submit* VM, e.g. p22-submit-dev, and run module load {python3 with version} (you can run module avail python3 to find which versions are available).

  2. on TSD: Create a file list JSON file:

    python3 ${SCRIPT_LOCATION}/createFileList.py \
        --output candidateFileList_tsd_YYYYMMDD.json >candidateFileList_tsd_YYYYMMDD.log
    
  3. on NSC (beta): Find the latest file list JSON file:

    Warning

    Samples that fail the pipeline are automatically put on the cleaning blacklist of the platform (NSC or TSD) where they were processed. However, samples failed on TSD must, in addition, be manually put on the NSC cleaning blacklist to prevent the FASTQ files from being deleted on NSC. Such samples must be added to the NSC cleaning blacklist before the candidateFileList_nsc_YYYYMMDD.json file is (automatically) created.

    The candidateFileList_nsc_YYYYMMDD.json and candidateFileList_nsc_YYYYMMDD.log are created automatically every day and updated every hour by the cron job at /home/nsc-serviceuser01/cron/find_deletion_candidates.sh (do crontab -e as serviceuser to see the schedule). Find the latest files in /boston/diag/transfer/clean.

    Or create one manually:

    python3 ${SCRIPT_LOCATION}/createFileList.py \
        --output candidateFileList_nsc_YYYYMMDD.json >candidateFileList_nsc_YYYYMMDD.log
    
  4. transfer the candidateFileList_nsc_YYYYMMDD.json file to TSD (see tip above) and move it to the NSC ARCHIVE_LOCATION.

  5. on TSD: Create the deleting commands to be used on TSD:

    python3 ${SCRIPT_LOCATION}/createDeleteBash.py \
        --input candidateFileList_tsd_YYYYMMDD.json >deleteCmd_tsd_YYYYMMDD.bash
    
  6. on TSD: Create the deleting commands to be used on NSC:

    python3 ${SCRIPT_LOCATION}/createDeleteBash.py \
        --input ${FILE_IMPORT}/candidateFileList_nsc_YYYYMMDD.json >${FILE_EXPORT}/deleteCmd_nsc_YYYYMMDD.bash
    
  7. transfer deleteCmd_nsc_YYYYMMDD.bash to NSC ARCHIVE_LOCATION (see tip above).

  8. delete ${FILE_EXPORT}/deleteCmd_nsc_YYYYMMDD.bash once it has been transferred to NSC.

  9. on NSC: Run the delete bash script (remember to make it executable first):

    bash /boston/diag/transfer/production/nsc_tsd_sync/clean/deleteCmd_nsc_YYYYMMDD.bash
    
  10. on TSD: Run the delete bash script (remember to make it executable first):

    bash deleteCmd_tsd_YYYYMMDD.bash
    
  11. On both NSC and TSD: Make sure the CandidateFileList_PLATFORM_YYYYMMDD.json and deleteCmd_PLATFORM_YYYYMMDD.bash files used, where PLATFORM is either tsd or nsc, are moved to the respective platform's cleaning ARCHIVE_LOCATION.

The scripts will clean the following locations:

On NSC:

  • /boston/diag/production/data/{samples,analyses-work,analyses-results/singles,analyses-results/trios}
  • /boston/diag/transfer/{normal,high,urgent}/{samples,analyses-results/singles,analyses-results/trios,ella-incoming}
  • /boston/diag/nscDelivery/RUN_DELIVERY/*fastq.gz (if there are no *.fastq.gz files under a RUN_DELIVERY folder, the whole RUN_DELIVERY folder can be deleted)

On TSD:

  • /ess/p22/data/durable/production/data/{samples,analyses-work,analyses-results/singles,analyses-results/trios}
  • /ess/p22/data/durable/s3-api/production/{normal,high,urgent}/{samples,analyses-work,analyses-results/singles,analyses-results/trios,ella-incoming}

Updating Clarity

Go through the steps 'Lims exporter', 'Bioinformatic processing' and 'Bioinformatic QC fail' in Clarity, and make sure that all samples are in the right stage.

Starting/Stopping production services

Only processes not owned by the service user must be started/stopped at the beginning/end of a production shift. The nsc-exporter process should normally be the only process falling into this category. Instructions on how to start/stop this as well as all other services can be found in sections How to start production services and How to stop production services.

Run the following command to have all aliases available:

on TSDsource {production}/sw/prod-screen-commands.txt

on NSCsource {production}/sw/nsc-screen-commands.txt

How to start production services

1. Starting LIMS exporter and NSC exporter

On NSC:

  • log into the server beta to access the service user script:

    ssh beta.ous.nsc.local
    

    if your username on the machine is different than your username on NSC, you need to add username@ as a prefix to the server name

    Then log in as the service user.

  • Start lims-exporter-api:

    nsc-start-lims-exporter-api #alias for screen -dm -S lims-exporter {script-home}/lims-exporter-api.sh
    
  • Log into sleipnir from beta:

    ssh sleipnir
    
  • Start nsc-exporter (as your personal user):

    The setup of tacl for transfer of files to TSD requires 2FA which the service user does not have. The files that end up in TSD's transfer areas are owned by nobody:p22-member-group regardless of which user performed the transfer.

    nsc-start-nsc-exporter #alias for screen -dm -S nsc-exporter {script-home}/nsc-exporter.sh
    

2. Starting filelock exporter

On TSD:

  • log into p22-cluster-sync as the service user > NOTE: should p22-cluster-sync not be available, p22-submit-dev can be used instead.

  • start filelock-exporter-api:

    prod-start-filelock-api #alias for screen -dm -S filelock-exporter {production}/sw/automation/filelock-exporter-api
    

3. Starting webUI, executor and NSC sequencing overview page

On TSD:

  • start executor, after logging into p22-submit as the service user:

    prod-start-executor #alias for screen -dm -S prod-executor {production}/sw/vcpipe-executor
    
  • start webui, after logging into p22-submit2 as the service user

    prod-start-webui #alias for screen -dm -S prod-webui {production}/sw/vcpipe-ui
    

    The UI runs on port 8080, you can open the UI on a web browser at p22-submit2:8080.

  • start nsc-overview, after logging into p22-submit2 (or another p22-submit* node) as the service user:

    prod-start-nsc-overview #alias for screen -dm -S prod-webui {production}/sw/nsc-overview
    

    You can open the UI on a web browser at p22-submit2:8889/sequencer.html (or p22-submit*:8889/sequencer.html)

On NSC:

  • start executor, after loging into diag-executor as the service user:

    nsc-start-executor #alias for screen -dm -S executor-nsc {production}/sw/executor-nsc
    
  • start webui, after logging into diag-webui as the service user:

    nsc-start-webui #alias for screen -dm -S webui-nsc {production}/sw/ui-nsc
    

    The UI runs on port 1234, so you may want to forward that port to your local machine:

    ssh {username}@diag-webui.ous.nsc.local -L 1234:localhost:1234
    

How to stop production services

Before stopping the following processes, confirm that they are sleeping.

  • lims-exporter:

  • nsc-exporter:

  • filelock-exporter: tail the log at

    {production}/logs/tsd-import/filelock-exporter/filelock_exporter.log
    
  • executor: check webUI on the corresponding network (NSC or TSD) to see whether any samples are running.

1. Stopping LIMS exporter and NSC exporter

On NSC:

  • Stop lims-exporter-api:

    • log into the server beta:
    ssh beta.ous.nsc.local  # or ssh 192.168.1.41
    
    • If you are the owner of the lims-exporter process (normally service user):
    nsc-stop-lims-exporter-api #alias for screen -X -S lims-exporter quit
    
    • If you're not the owner of the lims-exporter process:
    touch {transfer}/sw/kill-limsexporter
    

    Wait 5+ minutes after touching the kill file. Run screen -list to confirm that the process isn't running. Then remove the kill file.

  • Stop nsc-exporter:

    • Log into sleipnir from beta:
    ssh sleipnir
    
    • If you are the owner of the nsc-exporter process:
    nsc-stop-executor #alias for screen -X -S nsc-exporter quit
    
    • If you're not the owner of the nsc-exporter process:
    touch {transfer}/sw/kill-nscexporter
    

    Wait 5+ minutes after touching the kill file. Run screen -list to confirm that the process isn't running. Then remove the kill file.

2. Stopping filelock exporter

On TSD:

  • log into p22-cluster-sync:

    ssh p22-cluster-sync  # or p22-submit-dev when p22-cluster-sync is not available
    
  • If you are the owner of the filelock-exporter process (normally service user):

    prod-stop-filelock-api #alias for screen -X -S filelock-exporter quit
    
  • If you are not the owner of the filelock-exporter process:

    touch {production}/sw/killfilelock
    

    Wait 5+ minutes after touching the kill file. Run screen -list to confirm that the process isn't running. Then remove the kill file.

3. Stopping webUI, executor and NSC sequencing overview page

Note: Skip the NSC part if the NSC pipeline isn't running, skip the TSD part if the TSD pipeline isn't running.

  • If you are the owner of the production processes (normally service user), run the normal stop commands described below. Remember to remove any kill files before starting the processes.

    On TSD:

    • stop executor:

      ssh p22-submit #log into p22-submit
      prod-stop-executor #alias for screen -X -S prod-executor quit
      
    • stop prod-stop-webui:

      ssh p22-submit2 # log into p22-submit2
      prod-stop-webui #alias for screen -X -S prod-webui quit
      
    • stop nsc-overview:

      ssh p22-submit2 #log into p22-submit2
      prod-stop-nsc-overview #alias for screen -X -S nsc-overview quit
      

      alternatively, another p22-submit* node can be used, too

    On NSC:

    • stop executor:

      ssh diag-executor.ous.nsc.local # log into diag-executor
      nsc-stop-executor #alias for screen -X -S executor-nsc quit
      
    • stop webui:

      ssh diag-webui.ous.nsc.local #log into diag-webui
      nsc-stop-webui #alias for screen -X -S webui-nsc quit
      
  • If you are not the owner of the production processes, you must create kill files to make the processes stop themselves.

    For TSD and NSC, run:

    touch {production}/sw/kill{executor,webui}
    

    Wait 5+ minutes after touching the kill file, check the executor log and the webUI URL to confirm that the processes aren't running, then remove the kill files.

Errors and exceptions

How to redo demultiplexing

When demultiplexing must be repeated, e.g. due to wrong sample names or any other reasons, go through the following steps in Clarity:

  • Under "PROJECTS & SAMPLES", search for the project, open it and find one of the samples in the sequencing run, which needs demultiplexing.
  • Open the sample, click requeue (the blue circle arrow beside the step) for the step "Demultiplexing and QC NSC", and click "Demultiplexing and QC NSC". This will open a new view.
  • Click "Run" for "Auto 10: Copy run". When this is done, click "Run" for "Auto 20: Prepare SampleSheet". Wait until it is done, refresh the page in the browser (to make sure nothing is running in the background).
  • Click the file name in the "Demultiplexing sample sheet" under 'Files' to download the file, correct the information in this sample sheet and save it in another file. Remove the file in "Demultiplexing sample sheet" by clicking the cross and upload the file with the correct information in here.
  • Remove the delivery folder containing the files with the wrong information under /boston/diag/nscDelivery, change step "Auto 90. Delivery and triggers" and "Close when finish" from 'Yes' to 'No', and click "save" on top.
  • Click "Run" for the step "Auto 30. Demultiplexing". This will automatically proceed until "Auto 90". For a genome sequencing run, it could take around 2 hours for these steps to finish. Refresh the browser to see which steps are finished.
  • After the steps above are finished, check whether the files under /boston/diag/runs/demultiplexing/{RUN_FOLDER}/Data/Intensities/BaseCalls/{NSC_DELIVERY_FOLDER} contain the right information, change step "Auto 90. Delivery and triggers" and "Close when finish" back to 'Yes' and click "save" on top. Click "Run" on "Auto 90. Delivery and triggers".

Note: you need not click "Run" on "Close when finished". The delivery folder will appear in /boston/diag/nscDelivery, the /boston/diag/runs/demultiplexing will be empty and the samples will appear in the lims-exporter-api step in Clarity. If the information is still wrong, talk to the production team for possible solutions.

How to import specific samples

By default, all samples in the lims-exporter step in Clarity will be exported. If you only want to export specific samples, stop lims-exporter-api and start it again with a combination of any of these options:

  • --samples:  only export the given sample(s), e.g. 12345678910 or 12345678910,...
  • --projects:  only export samples in these project(s), e.g. Diag-excap172-2019-10-28 or Diag-excap172-2019-10-28,Diag-wgs22-2019-04-05
  • --priorities:  only export samples with the following priorities, e.g. 2 or 2,3

These options will be remembered. So to make lims-exporter-api export all samples again, you need to stop and restart it without any options and let it run continuously.

How to switch between NSC pipeline and TSD pipeline

Samples-analyses can target the TSD pipeline or the NSC pipeline. The default pipeline is specified in /boston/diag/transfer/sw/lims_exporter_api.config file's "default_platform" field.

The default pipeline can be overruled by adding project(s) or sample(s) to the platform-whitelist-NSC.txt and platform-whitelist-TSD.txt files in the /boston/diag/transfer/sw folder. In both whitelist files, lines starting with # are treated as comment lines. The format is <project>[-<sampleid>], one per line, e.g.

_Diag-wgs72
Diag-wgs72-12345678901

If only a project is given, all samples of the project will be included. Use only the portion of the project name before the 2nd dash, e.g. Diag-EKG210216 instead of Diag-EKG210216-2021-02-16.

Reanalyses will always target the TSD pipeline.

NSC pipeline results (analyses-results folder(s) and ella-incoming folder) are continuously and automatically transferred to TSD by nsc-exporter.

The majority of analyses are run on TSD, but some analyses might need to be run on NSC. These are:

Sample types Priority Description
exome 2 and 3 Priority is given by LIMS or communicated by the lab by other means.
target gene panel 1, 2 and 3 Captured by target gene panels.
genome WGS Trio 2 and 3 Rapid genome ("Hurtiggenom"); Priority is given by LIMS or by the lab.

Situations to consider when deciding:

  • TSD's HPC cluster is very busy (see cluster tools below)
  • The VMware login screen is unaccessible (log into TSD not possible)
  • The file system on TSD is slow / unresponsive
  • The VMs running the automation system (p22-submit, p22-submit2, p22-submit-dev) are not available
  • The s3-api folders for transferring data are not available
  • Scheduled maintenance
  • Problems with licence for required tools (e.g. Dragen)

Use the tools pending and qsumm to test the cluster's capacity:

  • pending : gives an indication of when the jobs owned by the bioinformatician in production will be launched.
  • qsumm : gives an overview of the all jobs pending or being processed in the slurm queue.

If the queue is still full by the end of the day, then the samples should be run on the backup pipeline.

How to update the S3 API Key

The long lived S3 API key must be updated yearly. This key was initially issued on May 13th, 2020.

  • Last update by yvastr May 5th, 2023;
  • Next update before May 5th, 2024.

The procedure for updating this key is:

  1. run this command to generate a new key:

    curl https://alt.api.tsd.usit.no/v1/p22/auth/clients/secret
        --request POST \
        -H "Content-Type: application/json" \
        --data '{
                "client_id": "_our client_id_",
                "client_secret": "_our current api key_"
                }'
    

    NOTE: Replace _our client_id_ with the client_id in /ess/p22/data/durable/api-client.md file. Replace _our current api key_ with the text in /boston/diag/transfer/sw/apikey_file.txt file.

    The above command will print a JSON string with 3 key value pairs:

    {
        "client_id": ...client id here...,
        "old_client_secret": ...old client_secret here...,
        "new_client_secret": ...new client_secret here...
    }
    
  2. Replace the text in /boston/diag/transfer/sw/s3api_key.txt with the "new_client_secret" value above.

Analysis is not imported into automation system

Start looking in the file system for:

  • whether the files for the analysis in question are in {production}/data/analyses-work/{analysis_name}
  • whether the corresponding samples are in the {production}/data/samples/{sample_name} folders
  • whether there is a file called READY in both the {sample_name} and {analysis_name} folders

If any of those are missing, proceed to investigate the logs in the file lock system (normally located at {production}/logs/tsd-import/filelock-exporter/{datetime}.log) for any clues as to why they were not imported into the system. Create a GitLab issue to describe the problem, so that it can be followed-up there.

If all necessary files are present, proceed to investigate the logs in the automation system for any clues as to why the analysis was not imported (see "New / unknown problem" section below).

annopipe troubleshooting

Force-delivered basepipe/triopipe samples failing QC will go through annopipe, but the will have a 'QC failed' tag in the webUI.

If annopipe fails, you can inspect the corresponding Nextflow work folder under the analysis folder. Check the STATUS file and find what step failed. Inspect the log file in the corresponding step folder.

If annopipe is run successfully (irrespectively of any "QC failed" tags), the post command will copy the required files to {production}/ella/ella-prod/data/analyses/incoming.

If the sample does not appear in ELLA's database, e.g. because the sample folder was not moved from the incoming folder to the imported folder, you can check the following:

  • whether the production instance of ELLA's analysis-watcher is still running on the Supervisor page (p22-ella-01:9007)
  • whether the sample is excluded from import into ELLA (see if it is listed in {production}/ella/ops/prod-watcher-blacklist.txt)
  • the log file under {production}/ella/ella-prod/logs/prod-watcher.log (prefer not checking logs from Supervisor page)

New/unknown problem

Start with looking at the analysis' log file. Normally, these are available in the automation system's UI, but in some cases the log in the database can be empty. In such a case, identify the analysis on the file system, and look for the log file in its result folder: {production}/data/analyses-work/{analysis_name}/result/{datetime_of_result}/logs/stdout.log. It is also helpful to check whether the number of sequencing reads is too low for the analysis.

If that log doesn't contain any information, there's likely been a problem starting the analysis. Look into the log of the automation system, normally located in {production}/logs/variantcalling/vcpipe/executor.log. grep for the analysis name to try to find the relevant section of the log, and if possible, check that the start time of the analysis in the UI matches the timestamp in the log.

To investigate the data output of the pipeline, look inside {production}/data/analyses-work/{analysis_name}/result/{datetime_of_result}/data.

Create a GitLab issue to describe the observed problem and consult with the rest of the bioinformatics group to find a solution. All the follow-up will be monitored there.

How to convert BAM files to FASTQ files

For some old samples, if the sample folder is not located in the sample folder archive, it must be created manually. The original sequencing data must be converted from the BAM file used in variant calling in the original analysis (file.bam) by using the following commands:

  1. Run RevertSam to convert BAM to SAM

    picard RevertSam
         I=file.bam
         O=file.sam
         RESTORE_ORIGINAL_QUALITIES=true
         SORT_ORDER=coordinate
         CREATE_INDEX=true
     ```
    
    If the BAM file has been already compressed into a CRAM file, one should run the following _before_ the above:
    
    ```bash
    samtools view -b -o file.bam -T GENOME_REFERENCE file.cram
    

    The GENOME_REFERENCE should be the one used for compressing the BAM file.

  2. Convert the SAM file into FASTQ

    picard SamToFastq
        I=sam.file
        FASTQ=file.R1.fastq
        SECOND_END_FASTQ=file.R2.fastq
        UNPAIRED_FASTQ=file.R0.fastq
    

The file.R1.fastq and file.R2.fastq are the corresponding read1 FASTQ file and read2 FASTQ file for the sample.

The sample configuration file (an example is attached and the required fields are described in "How to update S3 API Key") must be created under the individual's sample folder, and the FASTQ files and quality control results (fastqc folders) must be copied into the individual's sample folder as well. The structure of the folder is described in section "How to switch between NSC pipeline and TSD pipeline".

Background

  1. sleipnir

    The dedicated file transfer server on the NSC network is called sleipnir. It's a mostly locked down server, only connected to the file lock on TSD via a dedicated network channel. It only has access to /boston/diag/transfer.

  2. LIMS exporter API

    The LIMS exporter API exports samples from Clarity using Clarity's API. The samples are exported in order of priority and corresponding *.sample and *.analysis files are created inside given repositories.

    The files will be organized as follows:

    repo-analysis
      └── Diag-excap01-
        └── Diag-excap01-123456789.analysis
      └── Diag-excap01-123456789-EEogPU-v
        └── Diag-excap01-123456789-EEogPU-v02.analysis
    
    repo-sample
      └── Diag-excap01-
      ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R1_001_fastqc.tar
      ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R2_001_fastqc.tar
      ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R1_001.fastq.gz
      ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R2_001.fastq.gz
      ├── Diag-excap01-123456789.sample
     └── LIMS_EXPORT_DONE
    
    • Clarity exports samples and analyses automatically via lims-exporter-api.

    • lims-exporter-api exports all high priority samples.

    • lims-exporter-api will not export low priority samples when there are high priority samples to be exported.

    • When there are no high priority samples to be exported, lims-exporter-api will export low priority samples little by little. This is to avoid situations in which many low priority samples occupy nsc-exporter for too long, thereby delaying export of any incoming high priority samples.

    • The taqman-source is needed for single samples, as the TaqMan files are searched for the one containing the right sample id, which is then parsed to create a sample-specific fingerprint alongside the fastq.gz data files.

    • The fastq.gz files are hardlinked from the original (to avoid copying).

    • The log file is stored under /boston/diag/transfer/sw/logsAPI and a new file will be generated every time lims-exporter-api is restarted.

    Required fields in sample and analysis configuration files

    Required information in the sample configuration file (.sample file):

    • lane: in the sample FASTQ file name, e.g. 5.

    • reads under stats: obtained by counting the number of lines in the sample FASTQ files and dividing by 4.

    • q30_bases_pct under stats: obtained from the Demultiplex_Stats.htm file under the run folder.

    • sequencer_id: in the NSC delivery folder, e.g. D00132.

    • flowcell_id: in the sample QC report file name, e.g. C6HJJANXX.

    • all information under reads.

      The path should be the sample's FASTQ file name.

      The MD5 checksum is calculated by typing the following command in the terminal:

      md5sum <FASTQ_FILE_NAME>
      

      The size can be obtained by typing the following command in the terminal (it is the number before the date):

      ls -l <FASTQ_FILE_NAME>
      
    • project: in the NSC delivery folder, e.g. Diag-excap41.

    • project_date: in the NSC delivery folder, e.g. 2015-03-27.

    • flowcell: in the NSC delivery folder, e.g. B.

    • sample_id: in the sample FASTQ file name, e.g. 12345678910.

    • capturekit: converted from the information in the sample FASTQ file name, e.g. "Av5" is converted to agilent_sureselect_v05, "wgs" remains wgs.

    • sequence_date: in the NSC delivery folder, e.g. 2015-04-14.

    • name: combined project and sample_id joined by -, e.g. Diag-excap41-12345678910.

    • taqman: the name of the file containing the SNP fingerprinting TaqMan results.

    Required information in the analysis configuration file (.analysis file):

    • name:

      basepipe: combined "project" and "sample_id" delimited by -, e.g. Diag-excap41-12345678910.

      triopipe: combined "project", proband "sample_id" and TRIO tag delimited by -, e.g. “Diag-excap41-12345678910-TRIO.

      annopipe: combined basepipe or triopipe name, gene panel name and gene panel version delimited by -, e.g. Diag-excap41-12345678910-TRIO-Mendel-v01.

    • samples:

      basepipe: only one sample

      triopipe: three samples in trio

      annopipe: one sample or three samples in trio

      The sample name should be the same as the name in the corresponding .sample file.

    • type:

      basepipe: basepipe.

      triopipe: triopipe.

      annopipe: annopipe.

    • taqman in params: equal to false, only in basepipe .analysis file.

    • pedigree in params: only in triopipe and eventually annopipe .analysis files. For each of proband, father and mother, the sample and gender (male or female) must be specified. The sample should be the same as the name in the corresponding .sample file.

    • genepanel in params: only in annopipe .analysis file, combined gene panel name and gene panel version delimited by _, e.g. Mendel_v01.

  3. NSC exporter

    The NSC exporter transfers samples, analyses, analyses-results and ella-incoming folders (produced by the NSC pipeline) from the NSC locations {production}/{urgent,high,normal}/{analyses,samples,analyses-results/{singles,trio},ella-incoming}/ to the TSD S3 API endpoint at /ess/p22/data/durable/s3-api/. The NSC exporter runs continuously and is priority-based, meaning that urgent data are transferred before normal priority data. The log files are stored under /boston/diag/transfer/sw/logsAPI and a new file will be generated at the beginning of every month.

    The NSC exporter can be in different states: - Stopped - indicated by marker file /boston/diag/transfer/sw/NSC-EXPORTER-STOPPED. This file is touched by nsc-exporter when it is stopped. - Running and busy - indicated by marker file /boston/diag/transfer/sw/NSC-EXPORTER-ACTIVE. This file is touched by nsc-exporter when it is transferring data to TSD and removed when it is done. - Running and idle - indicated by no marker files, i.e., neither of the two marker files mentioned above exists.

  4. Transferring data from NSC to TSD

    The default route for production is

    lims-exporter-api.sh + nsc-exporter.sh + filelock-exporter-api.sh

    Features: fully automated; priority based; backup pipelines automated; transferring backup pipeline results to TSD is also automated; This uses S3 API to transfer data to TSD; data are written to the S3 API endpoint.

  5. Strategy to choose a single sample in reanalysis via lims-exporter-api when multiple samples match the sample ID

    The samples in the following projects will be ignored: - a predefined list of projects (e.g. test projects for testing various lab settings) - projects whose name contains the word 'test' - reanalysis projects - whole genome sequencing projects - projects with improper name (should be in the format: Diag-excapXX-YYYY-MM-DD)

    If there are still multiple samples: - choose the samples captured with Av5 KIT - choose samples in the latest project

    If there are still multiple samples matching or no samples were found, lims-exporter-api will send the sample to 'Manager review'.

    When sending to 'Manager review', lims-exporter-api will include the whole list of projects and samples to help the lab find the correct one.

  6. Order of reanalysis

    The request from the lab engineers should contain the following information: - The sample ID in the previous analysis (for sample IDs predating the introduction of Verso, the first nine digits are enough) - The gene panel name referred to in the reanalysis - The analysis type (trio or single) and the proband's gender if the analysis type is trio

Other documents

  • 08 Forebyggende tiltak, kontinuerlig forbedring, avviks-, klage- og skadeoppfølging - AMG (eHåndbok, ID 4965)
  • HTS - Custom genpanel og reanalyse av sekvensdata (eHåndbok, ID 121052)
  • HTS - Mismatch between TaqMan SNP-ID and sequencing data (eHåndbok, ID 90647)
  • HTS - Overordnet beskrivelse av arbeidsflyt (eHåndbok, ID 76458)
  • HTS - Use of NA samples for quality control (eHåndbok, ID 105870)
  • HTS - Samples that fail QC in bioinformatic pipeline (eHåndbok, ID 128858)
  • HTS Bioinf - Basepipe pipeline
  • HTS Bioinf - Trio pipeline
  • HTS Lab - TaqMan SNP-ID (eHåndbok, ID 56898)