Skip to content

HTS Bioinf - Execution and monitoring of pipeline

Scope

This procedure explains how to perform daily bioinformatic production, including monitoring and running the pipeline to analyse diagnostic High Throughput Sequencing (HTS) samples, and inform the units of eventual errors and delivery.

Responsibility

Responsible person: A qualified bioinformatician on production duty.

Overview

Common terms and locations

  • Automation system - The system automatically launching the pipelines for the incoming analyses and tracking their status in a dedicated database. The name of the automation system is vcpipe and the component responsible for running the analyses is named executor.
  • production/:   The folder for production data. It contains several subfolders:
    • data/analyses-results:   analyses results;
    • data/analyses-work:   analyses configuration and intermediate storage;
    • interpretations:   samples deliveries to the lab [TSD only];
    • logs:   logs from the automation system and its UI;
    • sw:   software used in production;
    • tmp:   if the system requires it (low disk space on /tmp), a temporary folder working for use with the TMPDIR environment variable.
  • serviceuser:   configuration files and scripts for logging in and running processes as a non-personal service user. See the "HTS Bioinf - Using service user" procedure.

NOTE: On TSD, the sequencing data are stored in /ess/p22/data/durable/production/data/samples; On NSC they are temporarily stored on production/data/samples. \ Whenever a fingerprinting test fails on a WGS sample, the respective .bam and .bai files are copied to /boston/diag/diagInternal/verifications on NSC alongside the output of the fingerprinting test itself.

This document refers to some common folders on the platforms where the pipeline is run.

On NSC the data relevant for the pipeline are stored on the server boston, available at /boston/diag.

Term NSC TSD
durable - /ess/p22/data/durable
ella - /ess/p22/data/durable/production/ella
ella-import-folder - /ess/p22/data/durable/production/ella/ella-prod/data/analyses/imported
production /boston/diag/production /ess/p22/data/durable/production
transfer /boston/diag/transfer/production /ess/p22/data/durable/s3-api/production

Communication to the users

To streamline the communication to the users whenever there are issues with any of the GDx operational systems (including TSD), and to announce releases, we use the Microsoft Teams GDx operational channel GDx driftsmeldinger.

Refer to this procedure for when and how to use the channel.

Planning production

The production coordinator, or, in their absence, the bioinformatics coordinator (see procedure HTS Bioinf Group roles), sets up a planned schedule every quarter (production quarter plan), and appoints two trained bioinformaticians on production duties, one as main responsible bioinformatician, the other one as backup bioinformatician to cover for the main responsible bioinformatician should the latter not be available (e.g. to due sickness). The plan should be updated on the web page once established.

Starting production

Registration of production duty

The main responsible bioinformatician should check whether the information about "Start Date", "Main responsible bioinfo." and "Backup bioinfo." for the current production interval on the web page is correct. If the information is not correct, the main responsible bioinformatician should update the web page with the correct information. All the names are registered with Signaturkode (OUS user name).

It is up to the main responsible bioinformatician to transfer production duty to the next main responsible bioinformatician according to the production quarter plan. Create a production shift issue in Gitlab using the production_shift template and go through the checklist to make sure knowledge is transferred to the next responsible and it is clear who will take care of which samples in the queue.

Once the duty has been transferred, the new main responsible bioinformatician must make sure all production services are running:

  • Executor
  • Filelock exporter
  • LIMS exporter
  • NSC exporter
  • NSC sequencing overview
  • WebUI

If any service isn't running, refer to the section How to start production services below.

Everyday duties

1. Ensure services health

  • Check that production services are up every week (Monday) and after any TSD/NSC downtime. Refer to the production restart checklist, which you can use as template for an issue in the checklists repository. See also the section Starting/Stopping production services.

    The table below shows the recommended distribution of production services.

    NOTE: deviations from this are possible in case any of the recommended VMs had not been available when the services were started.

    Service TSD VM[:Port] NSC Server/VM[:Port] Starting service Stopping service
    Anno p22-anno-01:9000 N/A See Running annoservice
    ELLA p22-ella-01:5000 N/A See ELLA core production and maintenance
    Executor p22-hpc-01 gdx-executor prod-start-executor prod-stop-executor
    File-lock exporter p22-hpc-02 N/A prod-start-filelock-api prod-stop-filelock-api
    LIMS exporter N/A gdx-login prod-start-lims-exporter prod-stop-lims-exporter
    NSC exporter N/A sleipnir prod-start-nsc-exporter prod-stop-nsc-exporter
    NSC overview p22-hpc-02:8889 ous-lims.sequencing.uio.no/over prod-start-nsc-overview prod-stop-nsc-overview
    WebUI p22-hpc-02:8080 gdx-webui:1234 prod-start-webui prod-stop-webui
    backup for other VMs p22-hpc-03 N/A N/A
  • Login Kerberos key on TSD for individual users

    Access to the /ess/p22 file system normally requires a valid Kerberos ticket. An expired, invalid ticket will result in denied access for individual users (the service user is exempt from Kerberos authentication).

    Run klist in the command line for all TSD servers which run production scripts (filelock-exporter-api, executor and webui), if the key is close to expiration, re-login to the server to update the key.

2. Check status of scheduled recurring jobs

We use Dagu to schedule and execute the recurring jobs, see HTS Bioinf - Scheduled recurring jobs. Check the web service running on Http://p22-app-01:8180 daily to catch any unexpected failures of these jobs. If a failure has occurred, make sure it is followed up.

3. Check status of sample processing

While a project is in progress, check the automation system's user interface (webui) a few times during the day for any failing analyses or other anomalies.

Check that the samples/analyses are imported into the system and that they start being processed. Compare the number of new analyses in the system to the number expected from the new project. Note that if the "Gene panel" field is set to 'Baerer' or 'Custom' for a sample, there will be no annopipe analysis folder for it.

For EKG, if the "Blank" sample has more reads than the threshold, lims-exporter-api will only process that sample and send the other samples in the same project to "Manager review" in Clarity. Then the "Blank" sample will be analyzed, and the results from both basepipe and annopipe must be force deliviered from the UI. Inform EKG when the results are ready, they will decide whether to continue with the analysis of the other samples or not.

If there are any failing analyses or other anomalies in the processing, see exceptions below for details. If the observed symptoms are not covered, investigate the issue (see New/Unknown problem) and raise the issue with the bioinformatics group immediately (and, if relevant, with other concerned parties).

4. Failing QC or Failing SNP fingerprinting test

If the QC fails, the pipeline will continue running. However, the analysis will be marked as "QC Failed" and the basepipe and triopipe results will not be copied to the corresponding analyses-results folders. Investigate the unmet criteria by looking in the "QC" of the analysis on the UI.

Runs that fail QC must be discussed with the lab. The solutions for some typical scenarios are described in the procedure "HTS - Samples that fail QC in bioinformatic pipeline" (eHåndbok, ID 128858). Relevant details must be documented in Clarity. If the final decision is to deliver results, this can be done by clicking "Deliver analysis" under "Post" for the analysis on the UI.

If the SNP fingerprinting test fails, the pipeline will stop running and the analysis will be marked as "Failed". Investigate how many samples are affected and whether the cause is too few sequencing reads (especially for EKG target sequencing samples). Should this be the case, notify the lab right away. If the number of mismatched sites is larger than 3 and other analyses from the same project are still running, you may have to wait for them to complete before proceeding with the investigation.

To find the number of reads, find the corresponding run on the "NSC sequencing overview" page and follow the "Demultiplexing" link. This will open a page in Clarity where you find information from the sequencing.

Generate a summary of failed and passed samples. Inform the lab as soon as possible. See procedure "HTS - Mismatch between TaqMan SNP-ID and sequencing data" (eHåndbok, ID 90647) for further possible courses of action.

If the lab wants to create a new TaqMan/fingerprint file and re-run the pipeline, you must add the sample name (Diag-<project>-<sampleID>) to the cleaning blacklists (TSD: {production}/sw; NSC: /boston/diag/transfer/sw). This will keep the sample files on NSC and TSD, and lims-exporter_api will automatically pick up the analyses when the lab adds them back to the queue in Clarity.

If a new TaqMan SNP-ID result is generated, make sure the updated TaqMan SNP-ID result is copied to the /ess/p22/data/durable/production/data/samples/Diag-<project>-<sampleID> folder (usually taken care of by filelock-exporter-api) and then re-run the analysis through the webUI (by clicking "Resume analysis").

5. Result delivery and Clarity update

When the pipeline results for a project are ready, the actions to take differ depending on which unit owns the project. The result will normally be imported automatically into ELLA a few hours after the pipeline has successfully finished.

The sample can be moved to the next step in Clarity only when:

  • the analysis of the sample is in ELLA's database or
  • the sample is available in the sample repo JSON file (in case no analysis, e.g. annotation, refers to it)

After confirming that the samples are in ELLA's database (see HTS Bioinf - ELLA core production and maintenance for how to access ELLA's database), respond to the Helpdesk ticket of the project delivery with the respective unit in "CC":

  • EKG ('EKG' in the project name): EKG@ous-hf.no
  • GDx ('excap' or 'wgs' in the project name): diag-lab@medisin.uio.no

The same recipients should be notified by email (via the Helpdesk ticket) also in case of errors.

For priority2 and priority3 samples, once the samples are in ELLA, a delivery email should be sent as soon as possible (via the Helpdesk ticket).

6. Handling control samples

When a control sample (sample NA12878, HG002, HG003, HG004) is in the project, the trend analyses should be carried out by the the bioinformatician in production duty as described in the procedure "HTS - Use of reference materials for internal quality control" (eHåndbok, ID 105870).

7. Reanalysis of samples

If basepipe results are not available or are available but not up-to-date across all samples in the analysis (e.g. they were produced using different versions of the pipeline), reanalysis of basepipe pipelines for all non-available or non-up-to-date samples will be triggered from Clarity through the lims-exporter-api's queue.

All analysis-work folders will be automatically generated by lims-exporter-api.

The analyses that require most tracing work are those involving hybrid trios, i.e. trios not all members of which are being sequenced at the same time. In order to confirm the successful execution of a reanalysis involving a hybrid trio it might be necessary to track all its members using the family ID. Note that the sample ID indicated in the name of the reanalysis may differ from the original one (Registered under the column "UDF/Reanalysis old sample ID Diag" in the sample sheet). The lims-exporter-api will find the right sample ID, otherwise it will send the sample to 'Manage Review' in the Clarity.

If the reanalysis does not involve a hybrid trio, check whether its results folder is present in the sample repo JSON file. If it is, notify the lab that the sample can be reanalysed directly from ELLA. If it is not, or if a basepipe re-run is warranted regardless (for these cases, see the procedure "HTS - Custom genpanel og reanalyse av sekvensdata" (eHåndbok, ID 121052)), the executor will find it in the sample folder located in /ess/p22/data/durable/production/data/samples.

When the analysis-work folder transfers to TSD, the filelock exporter will check whether the previous basepipe or triopipe results exists in {production}/data/analyses-results/singles or {production}/data/analyses-results/trios. If it is there, before moving the analysis-work folder to {production}/data/analyses-work/, the filelock exporter will move the existing results to /ess/p22/archive/production/analyses-results/{singles,trios} and change the folder name from {folder_name} to {folder_name}-YYYY-MM-DD.

The only possible manual intervention will be:

  • If the basepipe, triopipe or annopipe analysis already is registered in the production database, it won't be started automatically by the executor and will have to be started manually through the webUI by clicking 'Resume analysis' under 'Admin' tab.

8. Disk space on TSD and NSC

It is important to daily check the storage capacity (df -h). There should be more than 150T free space on both TSD and NSC (approx. 70% full on NSC). Check the following locations:

  • TSD: /ess/p22/
  • NSC: /boston/diag/

Storage space availability on NSC is more challenging to estimate. Inspect the file /boston/disk-usage.txt to get a better overview.

If cleaning is needed, follow the instructions in the section Cleaning the workspace. This is needed at least 1 time every week in order to maintain a disk usage below 70% on NSC.

Data compression

In order to save space, the BAM files of exome and genome samples can be compressed. This is an ongoing effort. See the procedure HTS Bioinf - Storage and security of sensitive data for how and when to compress BAM files.

Automatic deletion of BCL files

There is a cronjob that automatically deletes run folders, which contain BCL files, that have not been changed for the last 14 days. Find the log of this cronjob on NSC /boston/diag/runs/deleted_runs.log. The crontab can be edited by logging in to gdx-login as serviceuser and run crontab -e.

9. Update the annotation databases

If needed, the main responsible bioinformatician updates the annotation databases according to the procedures in HTS Bioinf - Update of external annotation data and HTS Bioinf - Update of in-house annotation data. In case they are busy with production, the backup bioinformatician can be asked to do it.

Every half a year, the ELLA classifications database should also be updated according to the following instructions.

  • Locate the latest ELLA classifications VCF file under /ess/p22/data/durable/production/anno/sensitive-db/ella-classifications on TSD;
  • Update the /ess/p22/data/durable/production/anno/sensitive-db/ella-classifications.vcf symlink to point to it;
  • Export the same file to NSC and update the corresponding ella-classifications.vcf symlink there to point to it.

10. Lab stand-up meeting

Both the main responsible bioinformatician and the backup bioinformatician should attend the lab stand-up meeting at 10:50 every Monday and Thursday.

Finishing production

Cleaning the workspace

The following are the locations relevant for this task:

FILE_EXPORT=/ess/p22/data/durable/file-export
FILE_IMPORT=/ess/p22/data/durable/file-import/p22-diag-ous-bioinf-group
SCRIPT_LOCATION=/ess/p22/data/durable/production/sw/utils/vcpipe-utilities/src/clean  # on TSD
SCRIPT_LOCATION=/boston/diag/production/sw/utils/vcpipe-utilities/src/clean  # on NSC
ARCHIVE_LOCATION=/ess/p22/data/durable/production/logs/tsd-cleaning #on TSD
ARCHIVE_LOCATION=/boston/diag/transfer/clean/archive #on NSC

Tip

Check the wiki for how to transfer files between TSD and NSC. When uploading files, add --group p22-diag-ous-bioinf-group to the tacl command, the files will be transferred to TSD at /ess/p22/data/durable/file-import/p22-diag-ous-bioinf-group.

Perform the cleaning in the following order:

Note

In the following steps, the file list JSON file and the deleting commands will be generated in the location the scripts are run from.

  1. on TSD: log into any p22-hpc-* VM, e.g. p22-hpc-03 as serviceuser.

  2. on TSD: Create a file list JSON file:

    python3 ${SCRIPT_LOCATION}/createFileList.py \
        --output candidateFileList_tsd_YYYYMMDD.json >candidateFileList_tsd_YYYYMMDD.log
    
  3. on NSC (gdx-login): Find the latest file list JSON file:

    Warning

    Samples that fail are automatically put on the cleaning blacklist of the platform (NSC or TSD) where they were processed. However, if they failed on TSD they must, in addition, be manually put on the NSC cleaning blacklist to prevent the FASTQ files from being deleted on NSC. Such samples must be added to the NSC cleaning blacklist before the candidateFileList_nsc*.json file is (automatically) created. Running analyses must also be added, otherwise the analyses-work folders may be deleted, causing the analyses to crash. Consider to blacklist every project that is not yet delivered.

    The candidateFileList_nsc_YYYYMMDD.json and candidateFileList_nsc_YYYYMMDD.log are created automatically every day and updated every hour by the cron job at /home/nsc-serviceuser01/cron/find_deletion_candidates.sh (do crontab -e as serviceuser to see the schedule). Find the latest files in /boston/diag/transfer/clean.

    To create one manually, run:

    python3 ${SCRIPT_LOCATION}/createFileList.py \
        --output candidateFileList_nsc_YYYYMMDD.json >candidateFileList_nsc_YYYYMMDD.log
    
  4. on NSC (sleipnir): transfer the candidateFileList_nsc_YYYYMMDD.json file to TSD in the folder p22-diag-ous-bioinf-group (see tip above).

  5. on TSD: Create the deleting commands to be used on TSD:

    python3 ${SCRIPT_LOCATION}/createDeleteBash.py \
        --input candidateFileList_tsd_YYYYMMDD.json >deleteCmd_tsd_YYYYMMDD.bash
    
  6. on TSD: Create the deleting commands to be used on NSC:

    python3 ${SCRIPT_LOCATION}/createDeleteBash.py \
        --input ${FILE_IMPORT}/candidateFileList_nsc_YYYYMMDD.json >${FILE_EXPORT}/deleteCmd_nsc_YYYYMMDD.bash
    
  7. on NSC (sleipnir): transfer deleteCmd_nsc_YYYYMMDD.bash to NSC (see tip above).

!!! warning

    Before the following steps, make sure no running analyses are included in the `deleteCmd_PLATFROM_YYYYMMDD.bash` script, where `PLATFORM` is either `tsd` or `nsc`. If so, edit the script. For instance, `grep` the projects to delete, or everything except those not to delete, into an updated delete script. Use the updated script in the next steps, and delete the original script.

8. on **NSC** (gdx-login`): Run the delete bash script:

```bash
bash /boston/diag/transfer/production/nsc_tsd_sync/clean/deleteCmd_nsc_YYYYMMDD.bash
```

!!! warning

    `sleipnir` does not have access to `/boston/diag/production`, so if you run the script from there, many files will not be deleted.
  1. on TSD: Once deleteCmd_nsc_YYYYMMDD.bash has been transferred to NSC and executed, delete ${FILE_IMPORT}/candidateFileList_nsc_YYYMMDD.json and ${FILE_EXPORT}/deleteCmd_nsc_YYYYMMDD.bash.

  2. on TSD: Run the delete bash script:

    bash deleteCmd_tsd_YYYYMMDD.bash
    
  3. On both NSC and TSD: Make sure the deleteCmd_PLATFORM_YYYYMMDD.bash files used, where PLATFORM is either tsd or nsc, are moved to the respective platform's cleaning ARCHIVE_LOCATION.

The scripts will clean the following locations:

On NSC:

  • /boston/diag/production/data/{samples,analyses-work,analyses-results/singles,analyses-results/trios}
  • /boston/diag/transfer/{normal,high,urgent}/{samples,analyses-results/singles,analyses-results/trios,ella-incoming}
  • /boston/diag/nscDelivery/RUN_DELIVERY/*fastq.gz (if there are no *.fastq.gz files under a RUN_DELIVERY folder, the whole RUN_DELIVERY folder can be deleted)

On TSD:

  • /ess/p22/data/durable/production/data/{samples,analyses-work,analyses-results/singles,analyses-results/trios}
  • /ess/p22/data/durable/s3-api/production/{normal,high,urgent}/{samples,analyses-work,analyses-results/singles,analyses-results/trios,ella-incoming}

Updating Clarity

Go through the steps 'Lims exporter', 'Bioinformatic processing' and 'Bioinformatic QC fail' in Clarity, and make sure that all samples are in the right stage.

Starting/Stopping production services

Only processes not owned by the service user must be started/stopped at the beginning/end of a production shift. The nsc-exporter process should normally be the only process falling into this category. Instructions on how to start/stop this as well as all other services can be found in sections How to start production services and How to stop production services.

Unless logged in as service user, run the following command to have the necessary aliases available:

source {production,transfer/..}/sw/prod-screen-commands.txt

How to start production services

On NSC

The relevant services are lims-exporter, nsc-exporter, executor and webui. To start them, follow the instructions below:

  • Log into the server gdx-login to access the service user script:

    ssh gdx-login.ous.nsc.local
    

    NOTE: if your username on the machine is different from your username on NSC, remember to add <username>@ before the server name. If there are issues with DNS configuration, use the IP-address instead of the server name.

  • For all services except nsc-exporter, log into the dedicated server/VM (see table) as the service user and run the corresponding alias (see the same table) for starting the service.

  • For nsc-exporter, ssh into the dedicated server/VM (see table) as your personal user and run the corresponding alias for starting the service (see the same table).

    NOTE: The setup of tacl for transferring files to TSD requires 2FA which the service user does not have. The files that end up in TSD's transfer areas are owned by nobody:p22-member-group regardless of which user performed the transfer.

  • To access the webUI at localhost:1234 on NSC (once it is started) run:

    ssh {username}@gdx-webui.ous.nsc.local -L 1234:localhost:1234
    

On TSD

The relevant services are filelock-exporter-api, executor, webui and nsc-overview. To start them, follow the instructions below:

  • Log into the dedicated server/VM (see table) as the service user and run the corresponding alias for starting the service (see the same table).

How to stop production services

Before stopping the following services, confirm that they are sleeping. Log into the dedicated server/VM (see table) and run htop (if available on the VM) or ps aux to check if the particular service is running. Alternatively, as the owner of the process, check the relevant screen session on the dedicated server/VM directly. As any user, tailing the most recent log of the process (see path below) is also possible. For executor, you also need to check webUI for any running jobs. No jobs should be running while killing/stopping executor.

  • lims-exporter, log at {transfer}/../sw/logsAPI/YYYY-MM-DD_HH-MM-SS.log.

  • nsc-exporter, log at {transfer}/../sw/nsc-exporter-YYYY-MM-01.log.

  • filelock-exporter, log at {production}/logs/tsd-import/filelock-exporter/filelock_exporter.log.

  • executor, log at {production}/logs/variantcalling/vcpipe/executor.log. Check the webUI (see port in table) on the corresponding network (NSC or TSD) to see whether any analyses are running.

As the owner of the process

The owner of all processes except nsc exporter is normally service user. If you are the owner of the process, log into the dedicated server/VM (see table) and run the corresponding alias for stopping the service (see the same table).

As any user other than the owner of the process

If you are not the owner of the production processes, you must create kill files to make the processes stop themselves. Follow the instructions below:

  • ssh into the dedicated server/VM (see table) and run the relevant command below (see name of kill file):

    touch {production}/sw/kill{filelock,executor,webui}
    
    touch {transfer../}/sw/kill-{limsexporter,nscexporter}
    
  • Wait 5+ minutes after touching the kill file. Then, run htop or ps aux to confirm that the given process isn't running. Alternativevly, check the most recent log (see path above) or the webUI URL (see table).

  • Finally, remove the kill file.

Errors and exceptions

How to redo demultiplexing

When demultiplexing must be repeated, e.g. due to wrong sample names or any other reasons, go through the following steps in Clarity:

Warning

If the folders of the project from the previous demultiplexing have been removed from the transfer area on NSC, they will automatically be re-queued to the "Lims_exporter_diag" step in Clarity and subsequently transferred to TSD. If this is not the intention, make sure to stop all production services on NSC before redoing the demultiplexing. After finishing the demultiplexing, consider importing specific samples only and/or move or remove any regenerated files that are not to be processed over again before resuming normal production.

Note

When one of the steps in the demultiplexing workflow completes successfully, the subsequent step in the process will automatically run in the background. The automation checks for completed steps every 2 mins. Refresh the page to see the current running/completed step. If nothing is running, you can manually start the next step before the automation system does. You can also disable the automation by turning the switches of the subsequent steps to "No" before running the desired step.

  • Under "PROJECTS & SAMPLES", search for the project, open it and find one of the samples in the sequencing run in need of demultiplexing.
  • Open the sample, click requeue (the blue circle arrow beside the step) for the step "Demultiplexing and QC NSC", and click "Demultiplexing and QC NSC". This will open a new view.

    Note

    If the new view does not appear, or it is not clickable, do this in stead: click "Demultiplexing and QC NSC" on the requeued step, then "add group", "view icebucket" and "begin work". This should open a clickable view.

  • Click "Run" for the step "10: Copy run". When this is done, click "Run" for the step "20: Prepare SampleSheet". Wait until it is done, refresh the page in the browser (to make sure nothing is running in the background) and immediately proceed with the next bullet point below.

  • Click the file name in the "Demultiplexing sample sheet" under the "Files" section to download the file. Immediately remove the file in "Demultiplexing sample sheet" by clicking the cross. This will ensure that the automated "30: Demultiplexing" step will not execute (it will in fact crash).
  • Correct the information in the sample sheet, and upload the file with the correct information in here.
  • Remove the delivery folder containing the files with the wrong information under /boston/diag/nscDelivery.
  • Change the steps "Auto 90. Delivery and triggers" and "Close when finish" from "Yes" to "No", and click "save" on top.
  • Click "Run" for the step "30: Demultiplexing". This will automatically proceed until "Auto 90. Delivery and triggers". For a genome sequencing run, it could take around 2 hours for these steps to finish. Refresh the browser to see which steps are finished.
  • After the steps above are finished, check whether the files under /boston/diag/runs/demultiplexing/{RUN_FOLDER}/Data/Intensities/BaseCalls/{NSC_DELIVERY_FOLDER} contain the right information.
  • Change the steps "Auto 90. Delivery and triggers" and "Close when finish" back to "Yes" and click "save" on top. Then, click "Run" for the step "90: Delivery and triggers".

    Note

    You don't need to click "Run" on "Close when finished". The delivery folder will appear in /boston/diag/nscDelivery, the /boston/diag/runs/demultiplexing will be empty and the samples will appear in the "Lims_exporter_diag" step in Clarity. If the information is still wrong, talk to the production team for possible solutions.

How to import specific samples

By default, all samples in the lims-exporter step in Clarity will be exported. If you only want to export specific samples, stop lims-exporter-api and start it again with a combination of any of these options:

  • --samples:  only export the given sample(s), e.g. 12345678910 or 12345678910,...
  • --projects:  only export samples in these project(s), e.g. Diag-excap172-2019-10-28 or Diag-excap172-2019-10-28,Diag-wgs22-2019-04-05
  • --priorities:  only export samples with the following priorities, e.g. 2 or 2,3

These options will be remembered. So to make lims-exporter-api export all samples again, you need to stop and restart it without any options and let it run continuously.

How to switch between NSC pipeline and TSD pipeline

Samples-analyses can target the TSD pipeline or the NSC pipeline. The default pipeline is specified in /boston/diag/transfer/sw/lims_exporter_api.config file's "default_platform" field.

The default pipeline can be overruled by adding project(s) or sample(s) to the platform-whitelist-NSC.txt and platform-whitelist-TSD.txt files in the /boston/diag/transfer/sw folder. In both whitelist files, lines starting with # are treated as comment lines. The format is <project>[-<sampleid>], one per line, e.g.

_Diag-wgs72
Diag-wgs72-12345678901

If only a project is given, all samples of the project will be included. Use only the portion of the project name before the 2nd dash, e.g. Diag-EKG210216 instead of Diag-EKG210216-2021-02-16.

Reanalyses will always target the TSD pipeline.

NSC pipeline results (analyses-results folder(s) and ella-incoming folder) are continuously and automatically transferred to TSD by nsc-exporter.

The majority of analyses are run on TSD, but some analyses might need to be run on NSC. These are:

Sample types Priority Description
exome 2 and 3* Whole exome
target gene panel 1, 2 and 3 Captured by target gene panels
genome WGS Trio 2 and 3* Rapid genome ("Hurtiggenom")
* Priority is given by the LIMS or communicated by the lab by other means

"Situations" to consider when deciding:

  • TSD's HPC cluster is very busy (see cluster tools below)
  • The VMware login screen is unaccessible (log into TSD not possible)
  • The file system on TSD is slow / unresponsive
  • The VMs running the automation system (p22-hpc-01, p22-hpc-02, p22-hpc-03) are not available
  • The s3-api folders for transferring data are not available
  • Scheduled maintenance
  • Problems with licence for required tools (e.g. Dragen)

Use the tools pending and qsumm to test the cluster's capacity:

  • pending : gives an indication of when the jobs owned by the bioinformatician in production will be launched.
  • qsumm : gives an overview of the all jobs pending or being processed in the slurm queue.

If the queue is still full by the end of the day, then the samples should be run on the backup pipeline.

How to update the S3 API Key

The long lived S3 API key must be updated yearly. This key was initially issued on May 13th, 2020.

  • Last update by eivindkb May 3rd, 2024;
  • Next update before May 3rd, 2025.

The procedure for updating this key is:

  1. SSH into sleipnir on NSC and run this command to generate a new key:

    curl https://alt.api.tsd.usit.no/v1/p22/auth/clients/secret
        --request POST \
        -H "Content-Type: application/json" \
        --data '{
                "client_id": "_our client_id_",
                "client_secret": "_our current api key_"
                }'
    

    !!! note

    Replace _our client_id_ with the client_id in /ess/p22/data/durable/api-client.md file. Replace _our current api key_ with the text in /boston/diag/transfer/sw/s3api_key.txt file.

    The above command will print a JSON string with 3 key value pairs:

    {
        "client_id": ...client id here...,
        "old_client_secret": ...old client_secret here...,
        "new_client_secret": ...new client_secret here...
    }
    
  2. Replace the text in /boston/diag/transfer/sw/s3api_key.txt with the "new_client_secret" value above.

Analysis is not imported into automation system

Start looking in the file system for:

  • whether the files for the analysis in question are in {production}/data/analyses-work/{analysis_name}
  • whether the corresponding samples are in the {production}/data/samples/{sample_name} folders
  • whether there is a file called READY in both the {sample_name} and {analysis_name} folders

If any of those are missing, proceed to investigate the logs in the file lock system (normally located at {production}/logs/tsd-import/filelock-exporter/{datetime}.log) for any clues as to why they were not imported into the system. Create a GitLab issue to describe the problem, so that it can be followed-up there.

If all necessary files are present, proceed to investigate the logs in the automation system for any clues as to why the analysis was not imported (see "New/Unknown problem" section below).

annopipe troubleshooting

Force-delivered basepipe/triopipe samples failing QC will go through annopipe, but will have a 'QC failed' tag in the webUI.

If annopipe fails, you can inspect the corresponding Nextflow work folder under the analysis folder. Check the STATUS file and find what step failed. Inspect the log file in the corresponding step folder.

If annopipe is run successfully (irrespectively of any "QC failed" tags), the post command will copy the required files to {production}/ella/ella-prod/data/analyses/incoming.

If the sample does not appear in ELLA's database, e.g. because the sample folder was not moved from the incoming folder to the imported folder, you can check the following:

  • whether the production instance of ELLA's analysis-watcher is still running on the Supervisor page (p22-ella-01:9007)
  • whether the sample is excluded from import into ELLA (see if it is listed in {production}/ella/ops/prod-watcher-blacklist.txt)
  • the log file under {production}/ella/ella-prod/logs/prod-watcher.log (prefer not checking logs from Supervisor page)

New/Unknown problem

Start with looking at the analysis' log file. Normally, these are available in the automation system's UI, but in some cases the log in the database can be empty. In such a case, identify the analysis on the file system, and look for the log file in its result folder: {production}/data/analyses-work/{analysis_name}/result/{datetime_of_result}/logs/stdout.log. It is also helpful to check whether the number of sequencing reads is too low for the analysis.

If that log doesn't contain any information, there's likely been a problem starting the analysis. Look into the log of the automation system, normally located in {production}/logs/variantcalling/vcpipe/executor.log. grep for the analysis name to try to find the relevant section of the log, and if possible, check that the start time of the analysis in the UI matches the timestamp in the log.

To investigate the data output of the pipeline, look inside {production}/data/analyses-work/{analysis_name}/result/{datetime_of_result}/data.

Create a GitLab issue to describe the observed problem and consult with the rest of the bioinformatics group to find a solution. All the follow-up will be monitored there.

How to convert BAM files to FASTQ files

For some old samples, if the sample folder is not located in the sample folder archive, it must be created manually. The original sequencing data must be converted from the BAM file used in variant calling in the original analysis (file.bam) by using the following commands:

  1. Run RevertSam to convert BAM to SAM

    picard RevertSam
         I=file.bam
         O=file.sam
         RESTORE_ORIGINAL_QUALITIES=true
         SORT_ORDER=coordinate
         CREATE_INDEX=true
     ```
    
    If the BAM file has been already compressed into a CRAM file, one should run the following _before_ the above:
    
    ```bash
    samtools view -b -o file.bam -T GENOME_REFERENCE file.cram
    

    The GENOME_REFERENCE should be the one used for compressing the BAM file.

  2. Convert the SAM file into FASTQ

    picard SamToFastq
        I=sam.file
        FASTQ=file.R1.fastq
        SECOND_END_FASTQ=file.R2.fastq
        UNPAIRED_FASTQ=file.R0.fastq
    

The file.R1.fastq and file.R2.fastq are the corresponding read1 FASTQ file and read2 FASTQ file for the sample.

The sample configuration file (an example is attached and the required fields are described in "How to update S3 API Key") must be created under the individual's sample folder, and the FASTQ files and quality control results (fastqc folders) must be copied into the individual's sample folder as well. The structure of the folder is described in section "How to switch between NSC pipeline and TSD pipeline".

Background

  1. sleipnir

    The dedicated file transfer server on the NSC network is called sleipnir. It's a mostly locked down server, only connected to the file lock on TSD via a dedicated network channel. It only has access to /boston/diag/transfer.

  2. LIMS exporter API

    The LIMS exporter API exports samples from Clarity using Clarity's API. The samples are exported in order of priority and corresponding *.sample and *.analysis files are created inside given repositories.

    The files will be organized as follows:

    repo-analysis
      └── Diag-excap01-
        └── Diag-excap01-123456789.analysis
      └── Diag-excap01-123456789-EEogPU-v
        └── Diag-excap01-123456789-EEogPU-v02.analysis
    
    repo-sample
      └── Diag-excap01-
      ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R1_001_fastqc.tar
      ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R2_001_fastqc.tar
      ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R1_001.fastq.gz
      ├── Diag-excap01-123456789-EEogPU-v02-KIT-Av5_GCCAAT_L007_R2_001.fastq.gz
      ├── Diag-excap01-123456789.sample
     └── LIMS_EXPORT_DONE
    
    • Clarity exports samples and analyses automatically via lims-exporter-api.

    • lims-exporter-api exports all high priority samples.

    • lims-exporter-api will not export low priority samples when there are high priority samples to be exported.

    • When there are no high priority samples to be exported, lims-exporter-api will export low priority samples little by little. This is to avoid situations in which many low priority samples occupy nsc-exporter for too long, thereby delaying export of any incoming high priority samples.

    • The taqman-source is needed for single samples, as the TaqMan files are searched for the one containing the right sample id, which is then parsed to create a sample-specific fingerprint alongside the fastq.gz data files.

    • The fastq.gz files are hard-linked from the original (to avoid copying).

    • The log file is stored under /boston/diag/transfer/sw/logsAPI and a new file will be generated every time lims-exporter-api is restarted.

    Required fields in sample and analysis configuration files

    Required information in the sample configuration file (.sample file):

    • lane: in the sample FASTQ file name, e.g. 5.

    • reads under stats: obtained by counting the number of lines in the sample FASTQ files and dividing by 4.

    • q30_bases_pct under stats: obtained from the Demultiplex_Stats.htm file under the run folder.

    • sequencer_id: in the NSC delivery folder, e.g. D00132.

    • flowcell_id: in the sample QC report file name, e.g. C6HJJANXX.

    • all information under reads.

      The path should be the sample's FASTQ file name.

      The MD5 checksum is calculated by typing the following command in the terminal:

      md5sum <FASTQ_FILE_NAME>
      

      The size can be obtained by typing the following command in the terminal (it is the number before the date):

      ls -l <FASTQ_FILE_NAME>
      
    • project: in the NSC delivery folder, e.g. Diag-excap41.

    • project_date: in the NSC delivery folder, e.g. 2015-03-27.

    • flowcell: in the NSC delivery folder, e.g. B.

    • sample_id: in the sample FASTQ file name, e.g. 12345678910.

    • capturekit: converted from the information in the sample FASTQ file name, e.g. "Av5" is converted to agilent_sureselect_v05, "wgs" remains wgs.

    • sequence_date: in the NSC delivery folder, e.g. 2015-04-14.

    • name: combined project and sample_id joined by -, e.g. Diag-excap41-12345678910.

    • taqman: the name of the file containing the SNP fingerprinting TaqMan results.

    Required information in the analysis configuration file (.analysis file):

    • name:

      basepipe: combined "project" and "sample_id" delimited by -, e.g. Diag-excap41-12345678910.

      triopipe: combined "project", proband "sample_id" and TRIO tag delimited by -, e.g. “Diag-excap41-12345678910-TRIO.

      annopipe: combined basepipe or triopipe name, gene panel name and gene panel version delimited by -, e.g. Diag-excap41-12345678910-TRIO-Mendel-v01.

    • samples:

      basepipe: only one sample

      triopipe: three samples in trio

      annopipe: one sample or three samples in trio

      The sample name should be the same as the name in the corresponding .sample file.

    • type:

      basepipe: basepipe.

      triopipe: triopipe.

      annopipe: annopipe.

    • taqman in params: equal to false, only in basepipe .analysis file.

    • pedigree in params: only in triopipe and eventually annopipe .analysis files. For each of proband, father and mother, the sample and gender (male or female) must be specified. The sample should be the same as the name in the corresponding .sample file.

    • genepanel in params: only in annopipe .analysis file, combined gene panel name and gene panel version delimited by _, e.g. Mendel_v01.

  3. NSC exporter

    The NSC exporter transfers samples, analyses, analyses-results and ella-incoming folders (produced by the NSC pipeline) from the NSC locations {production}/{urgent,high,normal}/{analyses,samples,analyses-results/{singles,trio},ella-incoming}/ to the TSD S3 API endpoint at /ess/p22/data/durable/s3-api/. The NSC exporter runs continuously and is priority-based, meaning that urgent data are transferred before normal priority data. The log files are stored under /boston/diag/transfer/sw/logsAPI and a new file will be generated at the beginning of every month.

    The NSC exporter can be in different states: - Stopped - indicated by marker file /boston/diag/transfer/sw/NSC-EXPORTER-STOPPED. This file is touched by nsc-exporter when it is stopped. - Running and busy - indicated by marker file /boston/diag/transfer/sw/NSC-EXPORTER-ACTIVE. This file is touched by nsc-exporter when it is transferring data to TSD and removed when it is done. - Running and idle - indicated by no marker files, i.e., neither of the two marker files mentioned above exists.

  4. Transferring data from NSC to TSD

    The default route for production is

    lims-exporter-api.sh + nsc-exporter.sh + filelock-exporter-api.sh

    Features: fully automated; priority based; backup pipelines automated; transferring backup pipeline results to TSD is also automated; This uses S3 API to transfer data to TSD; data are written to the S3 API endpoint.

  5. Strategy to choose a single sample in reanalysis via lims-exporter-api when multiple samples match the sample ID

    The samples in the following projects will be ignored: - a predefined list of projects (e.g. test projects for testing various lab settings) - projects whose name contains the word 'test' - reanalysis projects - whole genome sequencing projects - projects with improper name (should be in the format: Diag-excapXX-YYYY-MM-DD)

    If there are still multiple samples: - choose the samples captured with Av5 KIT - choose samples in the latest project

    If there are still multiple samples matching or no samples were found, lims-exporter-api will send the sample to 'Manager review'.

    When sending to 'Manager review', lims-exporter-api will include the whole list of projects and samples to help the lab find the correct one.

  6. Order of reanalysis

    The request from the lab engineers should contain the following information: - The sample ID in the previous analysis (for sample IDs predating the introduction of Verso, the first nine digits are enough because the last digit of the sample Id is incremented upon reanalysis) - The gene panel name referred to in the reanalysis - The analysis type (trio or single) and the proband's gender if the analysis type is trio

Other documents

  • 08 Forebyggende tiltak, kontinuerlig forbedring, avviks-, klage- og skadeoppfølging - AMG (eHåndbok, ID 4965)
  • HTS - Custom genpanel og reanalyse av sekvensdata (eHåndbok, ID 121052)
  • HTS - Mismatch between TaqMan SNP-ID and sequencing data (eHåndbok, ID 90647)
  • HTS - Overordnet beskrivelse av arbeidsflyt (eHåndbok, ID 76458)
  • HTS - Use of reference materials for internal quality control (eHåndbok, ID 105870)
  • HTS - Samples that fail QC in bioinformatic pipeline (eHåndbok, ID 128858)
  • HTS Bioinf - Basepipe pipeline
  • HTS Bioinf - Trio pipeline
  • HTS Lab - TaqMan SNP-ID (eHåndbok, ID 56898)