Skip to content

HTS Bioinf - Update Databases

Scope

This document describes the procedures for generating external annotation data sets for anno.

Data sources

The instructions for generating the annotation data sets are coordinated by ella-anno/opts/datasets.json and anno-targets/datasets.json. The repository of competence for each data source is shown in the table below together with the agreed respective update frequency.

Data sources Repository Updates
ClinVar (and PubMed db) ella-anno monthly
HGMD (and PubMed db) anno-targets quarterly
wgsDB anno-targets quarterly
VEP ella-anno yearly
inDB anno-targets yearly
gnomAD, SeqRepo, UTA, RefSeq ella-anno irregularly
gnomAD-MT, gnomAD-SV, SweGen-SV, AnnotSV anno-targets irregularly

For data sources with irregular updates, we will check every quarter for new releases, and update whenever there is a suitable and significant one.

Update Procedure

Credentials

Digital Ocean (DO) \ To perform the steps of this procedure, a "Personal access token" for access to the DigitalOcean OUSAMG project is required. This should be in a file somewhere on your file system (e.g. $HOME/.digital_ocean/do_creds), formatted as

SPACES_KEY="<key>"
SPACES_SECRET="<secret>"

Directions for creating these credentials are available here.

HGMD \ The easiest way to supply HGMD credentials is to append these to the DigitalOcean credentials file, e.g. $HOME/.digital_ocean/do_creds.

HGMD_USER="<user>"
HGMD_PASSWORD="<password>"

NCBI API \ An NCBI API token should also be obtained before starting the update. Follow the instructions here and here. Then export ENTREZ_API_KEY either in your terminal or in your .bashrc file.

Automatic data generation

  1. Clone the relevant repository (i.e., ella-anno or anno-targets, refer to the table above).

  2. Update datasets.json with the version you wish to generate. If required (which is rare), modify the generate commands accordingly.

  3. make build-annobuilder.

  4. make generate-[amg-]package PKG_NAME=<package name> (include amg for anno-targets data sources, check the Makefile if in doubt, make help may help).

    For ClinVar updates, append ENTREZ_API_KEY="<key>" in front of the make command.

  5. make upload-[amg-]package PKG_NAME=<package_name> DO_CREDS=$HOME/.digital_ocean/do_creds (include amg for anno-targets data sources, check the Makefile if in doubt, make help may help).

    For ClinVar updates, append ENTREZ_API_KEY="<key>" in front of the make command.

    For HGMD updates you will need to supply the location of the reference FASTA file as FASTA=/path/to/fasta. If you do not have this file locally, use the following make command to download it from DO:

    make download-anno-package PKG_NAME=fasta DO_CREDS=$HOME/.digital_ocean/do_creds
    
  6. Commit and push the changes to datasets.json in an aptly named branch (refer to a pre-existing issue in the respective repository if applicable) and file a MR. Use the merge request template data_mr_template, which proposes basic sanity checks for the newly generated data.

  7. Once the MR is approved, merge your branch into dev.

  8. After merging, follow the Release and deploy procedure for anno system.

Update literature reference database

In ELLA, we aim to keep data for all PubMed references present in either HGMD or ClinVar. These PubMed ids are generated as line-separated text files in the HGMD or ClinVar data directories.

  1. Clone the anno-targets repository
  2. make dowload-amg-package PKG_NAME=hgmd
  3. make download-package PKG_NAME=clinvar
  4. cat anno-data/variantDBs/*/*_pubmed_ids.txt | sort -n | uniq >pubmed_ids.txt

The next steps are to download reference details for all these PubMed ids:

  1. Preparation. Because some of the operations below use git submodule under the hood, it is recommended to set your ssh in advance, e.g.

    eval $(ssh-agent -s)
    ssh-add
    
  2. Clone the ELLA repository

  3. Copy pubmed_ids.txt into the ELLA directory
  4. make build; make dev; make shell [If you run into permission issues here, do chmod -R a+rwX /storage/ella]
  5. ella-cli references fetch pubmed_ids.txt (this will take some time)
  6. Import the file created in the previous step (references_YYMMDD.txt) to TSD

Finally, deposit the references in the ELLA database:

# TSD
/tsd/p22/data/durable/production/ella/ops/prod-cli.sh
ella-cli deposit references "<path to references_YYMMDD.txt>"