HTS Bioinf - Update of external annotation data
Scope
This document describes the procedures for generating external annotation data sets for Anno.
Data sources
The instructions for generating the annotation data sets are coordinated by Anno and Anno-targets's datasets.json
files. The repository of competence for each data source is shown in the table below together with the agreed respective update frequency.
Data sources | Repository | Updates |
---|---|---|
AlphaMissense | ELLA Anno | reference-dependent |
ClinVar (and PubMed db) | ELLA Anno | monthly |
ClinVar-MT | Anno-targets | monthly |
FASTA (nuclear/mito) | ELLA Anno | reference-dependent |
gnomAD, SeqRepo, UTA, RefSeq | ELLA Anno | irregularly |
gnomAD-MT, gnomAD-SV, SweGen-SV, AnnotSV | Anno-targets | irregularly |
HGMD (and PubMed db) | Anno-targets | quarterly |
inDB-Mito | Anno-targets | yearly |
inDB-WES | Anno-targets | frozen |
inDB-WGS | Anno-targets | frozen |
inDB-WGSX | Anno-targets | yearly |
liftOver | ELLA Anno | reference-dependent |
mitomapCPM | Anno-targets | yearly |
REVEL | ELLA Anno | yearly |
SegDups | ELLA Anno | reference-dependent |
SpliceAI | ELLA Anno | one-time |
STRexp | Anno-targets | yearly |
VEP | ELLA Anno | yearly |
For data sources with irregular updates, we will check every quarter for new releases, and perform any suitable and significant ones.
Update Procedure
Credentials
All credentials required by Anno to manage external annotation data sets are expected to be stored in a credentials file, provided to Anno's Makefile
via the environment variable DB_CREDS
. The common credential file located at /storage/ops-common/.db_creds
on the Hetzner development server should be kept up to date.
DigitalOcean (DO) -
To download and upload data to DigitalOcean's OUSAMG project, a DO access key and its corresponding secret are required. Directions for generating these credentials are available here. Store your key and secret as environment variables in the DB_CREDS
file (which we assume will be set to /storage/ops-common/.db_creds
) as follows:
HGMD -
HGMD credentials are required to download HGMD data. Store your HGMD user name and password as environment variables in the DB_CREDS
file as follows:
NCBI -
An ENTREZ API token is necessary to download bulk NCBI data. Follow the instructions here and here to obtain a token and add it to the DB_CREDS
file as follows:
Automatic data generation
ssh hetzner
from your local machine and navigate to/storage/ops-common
Tip
If necessary it is possible to run the update from anywhere using your own credentials, given you have enough storage space for the data.
-
Clone the relevant repository (i.e.,
ella-anno
oranno-targets
, refer to the table above) or make sure your version of the software is up to date with the remote. -
Update
datasets.json
with the version you wish to generate. If required (which is rare), modify thegenerate
commands accordingly. -
make build-annobuilder
. -
make generate-{amg,anno}-package DB_CREDS=/storage/ops-common/.db_creds PKG_NAME=<package name>
(use-amg
foranno-targets
data sources, check theMakefile
if in doubt,make help
may help). -
make upload-{amg,anno}-package DB_CREDS=/storage/ops-common/.db_creds PKG_NAME=<package_name>
(use-amg
foranno-targets
data sources, check theMakefile
if in doubt,make help
may help).For HGMD updates you will need to supply the location of the reference FASTA file as
FASTA=/path/to/fasta
. The followingmake
command can be used to download it from DO. -
Commit and push the changes to
datasets.json
in an aptly named branch (refer to a pre-existing issue in the respective repository if applicable) and file a MR. Use the merge request template data_mr_template, which proposes basic sanity checks for the newly generated data. -
Once the MR is approved, merge your branch into
dev
. -
After merging, follow the Release and deploy procedure for anno system.
Update the literature reference database
In ELLA, we aim to keep data for all PubMed references present in either HGMD or ClinVar. These PubMed ids are generated as line-separated text files in the HGMD or ClinVar data directories.
- Clone the
anno-targets
repository make download-amg-package PKG_NAME=hgmd DB_CREDS=/storage/ops-common/.db_creds
make download-anno-package PKG_NAME=clinvar DB_CREDS=/storage/ops-common/.db_creds
Go through the next steps to download reference details for all PubMed ids:
-
Preparation. Because some of the operations below use
git submodule
under the hood, it is recommended to setup yourssh
in advance, e.g. -
Clone the ELLA repository if you haven't done this before, and change into it
-
Concatenate the PubMed ids (one per line, removing duplicates) to a text file in the ELLA directory, e.g.
sort -un <path-to-anno-data>/variantDBs/*/*_pubmed_ids.txt > pubmed_ids.txt
-
Access
ella-cli
via Docker container:docker compose run -e LOGPATH=/tmp -u $(id -u):$(id -g) --no-deps -it --entrypoint /bin/bash --build apiv1
-
Change to the root
/ella
directory and runella-cli references fetch pubmed_ids.txt
(this will take some time) -
Exit the container
-
Import the file created in the previous step (
references-YYMMDD.txt
) to TSD (see wiki for tacl usage instructions) -
Deposit the references in the ELLA database:
-
Delete the file used to deposit the references