Skip to content

HTS Bioinf - Update of external annotation data

Scope

This document describes the procedures for generating external annotation data sets for Anno.

Data sources

The instructions for generating the annotation data sets are coordinated by Anno and Anno-targets's datasets.json files. The repository of competence for each data source is shown in the table below together with the agreed respective update frequency.

Data sources Repository Updates
ClinVar (and PubMed db) Anno monthly
gnomAD, SeqRepo, UTA, RefSeq Anno irregularly
gnomAD-MT, gnomAD-SV, SweGen-SV, AnnotSV Anno-targets irregularly
HGMD (and PubMed db) Anno-targets quarterly
inDB-WES Anno-targets yearly
inDB-WGS Anno-targets frozen
inDB-WGSX Anno-targets quarterly
VEP Anno yearly

For data sources with irregular updates, we will check every quarter for new releases, and update whenever there is a suitable and significant one.

Update Procedure

Credentials

All credentials required by Anno to manage external annotation data sets are expected to be stored in a credentials file, provided to Anno's Makefile via the environment variable DB_CREDS.

Tip

If you are updating from Hetzner you do not need to create your own db_creds file. There is a common file with regularly updated keys located on Hetzner: /storage/ops-common/.db_creds. Remember to replace $HOME/.db_creds with /storage/ops-common/.db_creds during the rest of the procedure if you choose to do this.

DigitalOcean (DO)   -   To download and upload data to DigitalOcean's OUSAMG project, a DO access key and its corresponding secret are required. Directions for generating these credentials are available here. Store your key and secret as environment variables in the DB_CREDS file (which we assume will be set to $HOME/.db_creds) as follows:

SPACES_KEY=your-digitalocean-key
SPACES_SECRET=your-digitalocean-secret

HGMD   -   HGMD credentials are required to download HGMD data. Store your HGMD user name and password as environment variables in the DB_CREDS file as follows:

HGMD_USER=your-hgmd-username
HGMD_PASSWORD=your-hgmd-password

NCBI   -   An ENTREZ API token is necessary to download bulk NCBI data. Follow the instructions here and here to obtain a token and add it to the DB_CREDS file as follows:

ENTREZ_API_KEY=your-entrez-api-key

Automatic data generation

In the following, we will assume a credentials file .db_creds to exist in $HOME.

Tip

Some of the steps below may be resource demanding and time consuming. Consider generating the data on a development server. If you do, do not store the data on your $HOME directory.

  1. Clone the relevant repository (i.e., ella-anno or anno-targets, refer to the table above).

  2. Update datasets.json with the version you wish to generate. If required (which is rare), modify the generate commands accordingly.

  3. make build-annobuilder.

  4. make generate[-amg]-package DB_CREDS=$HOME/.db_creds PKG_NAME=<package name> (include -amg for anno-targets data sources, check the Makefile if in doubt, make help may help).

  5. make upload[-amg]-package DB_CREDS=$HOME/.db_creds PKG_NAME=<package_name> (include -amg for anno-targets data sources, check the Makefile if in doubt, make help may help).

    For HGMD updates you will need to supply the location of the reference FASTA file as FASTA=/path/to/fasta. The following make command can be used to download it from DO.

    make download-anno-package DB_CREDS=$HOME/.db_creds PKG_NAME=fasta
    
  6. Commit and push the changes to datasets.json in an aptly named branch (refer to a pre-existing issue in the respective repository if applicable) and file a MR. Use the merge request template data_mr_template, which proposes basic sanity checks for the newly generated data.

  7. Once the MR is approved, merge your branch into dev.

  8. After merging, follow the Release and deploy procedure for anno system.

Update literature reference database

In ELLA, we aim to keep data for all PubMed references present in either HGMD or ClinVar. These PubMed ids are generated as line-separated text files in the HGMD or ClinVar data directories.

  1. Clone the anno-targets repository
  2. make download-amg-package PKG_NAME=hgmd DB_CREDS=$HOME/.db_creds
  3. make download-anno-package PKG_NAME=clinvar DB_CREDS=$HOME/.db_creds

Go through the next steps to download reference details for all PubMed ids:

  1. Preparation. Because some of the operations below use git submodule under the hood, it is recommended to set your ssh in advance, e.g.

    eval $(ssh-agent -s)
    ssh-add
    
  2. Clone the ELLA repository if you haven't done this before, and change into it

  3. Concatenate the PubMed ids (one per line, removing duplicates) to a text file in the ELLA directory, e.g. sort -un <path-to-anno-data>/variantDBs/*/*_pubmed_ids.txt > pubmed_ids.txt

  4. Access ella-cli via Docker container: docker compose run -e LOGPATH=/tmp -u $(id -u):$(id -g) --no-deps -it --entrypoint /bin/bash --build apiv1

  5. Change to the root /ella directory and run ella-cli references fetch pubmed_ids.txt (this will take some time)

  6. Exit the container

  7. Import the file created in the previous step (references-YYMMDD.txt) to TSD (see wiki for tacl usage instructions)

  8. Deposit the references in the ELLA database:

    # TSD
    task --dir /ess/p22/data/durable/production/ella/ops prod:cli
    ella-cli deposit references "<path to references_YYMMDD.txt>"
    
  9. Delete the imported file used to deposit the references