HTS Bioinf - Update Databases
Scope
This document describes the procedures for generating external annotation data sets for anno.
Data sources
The instructions on data sources generations are coordinated via ella-anno/opts/datasets.json
and anno-targets/datasets.json
files. Here is the table of distribution of data sources between these repositories, together with agreed frequencies of updates.
Data sources | Repository | Updates |
---|---|---|
ClinVar (and pubmed db) | ella-anno | monthly |
HGMD (and pubmed db) | anno-targets | quarterly |
wgsDB | anno-targets | quarterly |
inDB | anno-targets | yearly |
VEP | ella-anno | yearly |
gnomAD, SeqRepo, UTA, RefSeq | ella-anno | irregularly |
gnomad_sv, swegen_sv, annotsv | anno-targets | irregularly |
gnomad_mt | anno-targets | irregularly |
For the data sources with irregular updates, we will check every quarter for new updates, and update whenever there is a new suitable and significant release.
Update Procedure
Credentials for DO
To perform the steps of this procedure, a “Personal access token" to the DigitalOcean OUSAMG project is required. These should be in a file somewhere on your disk (e.g. $HOME/.digital_ocean/do_creds
), formatted as
Directions on creating these credentials are available here.
Credentials for NCBI API
An NCBI API token should also be obtained before starting the update. Follow the intructions here and here.
Then export ENTREZ_API_KEY
either in your terminal or in your bashrc file.
Credentials for HGMD
The easiest way to supply HGMD credentials is to simply append these to the file created in, e.g. $HOME/.digital_ocean/do_creds
.
Automatic data generation
-
Clone the relevant repository (i.e.,
ella-anno
oranno-targets
, refer to the table above). -
Update
datasets.json
with the version you wish to generate. If required (which is rare), modify the generate-commands accordingly. -
make build-annobuilder
. -
make generate-(amg-)package PKG_NAME=<package name>
(command withamg
foranno-targets
, check theMakefile
if in doubt, i.e.make help
). -
For ClinVar updates, append ENTREZ_API_KEY=<> in front of the make command.
-
make upload-(amg-)package PKG_NAME=<package_name> DO_CREDS=$HOME/.digital_ocean/do_creds
(command withamg
foranno-targets
, check theMakefile
if in doubt). - For ClinVar updates, append ENTREZ_API_KEY=<> in front of the make command.
- for HGMD updates you will need to supply the location of the reference fasta file as
FASTA=/path/to/fasta
. If you do not have this file locally, use make command to download it from DO:
-
Commit and push the changes to
datasets.json
in a properly named branch. For this, it is preferred to create an MR from already existing issue in respective repository. Use the merge request template data_mr_template, which proposes basic sanity checks for the newly generated data. -
Merge MR into
dev
. -
After MR is merged, follow Release and deploy procedure for anno system.
Update literature reference database
In ELLA, we aim to keep data for all pubmed-references present in either HGMD or ClinVar. These pubmed ids are generated as line-separated text-files in the HGMD or ClinVar data directories.
- Clone anno-targets repository https://gitlab.com/alleles/anno-targets
make dowload-amg-package PKG_NAME=hgmd
make download-package PKG_NAME=clinvar
cat anno-data/variantDBs/*/*_pubmed_ids.txt | sort -n | uniq > pubmed_ids.txt
Next steps are to download reference details for all these pubmed ids:
- Preparation. Because some of the operations below use
git submodule
under the hood, it is recommended to set yourssh
in advance, e.g.
- Clone the ELLA repository https://gitlab.com/alleles/ella
- Copy
pubmed_ids.txt
into the ELLA-folder make build; make dev; make shell
. If you run into permission issues here, dochmod -R a+rwX /storage/ella
ella-cli references fetch pubmed_ids.txt
(this will take some time)- Import the file created from the previous step to TSD (
references_YYMMDD.txt
)
Finally, deposit the references in the ELLA database: