Annotating your VCF with VEP¶
The input to the pVACseq pipeline is a VEP-annotated VCF. This will add consequence, transcript, and gene information to your VCF.
Installing VEP¶
To download and install the VEP command line tool follow the VEP installation instructions.
We recommend the use of the VEP cache for your annotation. The VEP cache can be downloaded following these VEP cache installation instructions. Please ensure that the Ensembl cache version matches the reference build and Ensembl version used in other parts of your analysis (e.g. for RNA-seq gene/transcript abundance estimation).
Download the VEP plugins from the GitHub repository by cloning the repository:
git clone https://github.com/Ensembl/VEP_plugins.git
Copy the Wildtype and Frameshift plugins provided with the pVACseq package to the folder with the other VEP plugins by running the following command:
pvacseq install_vep_plugin <VEP plugins directory>
Running VEP¶
Example VEP Command
./vep \
--input_file <input VCF> --output_file <output VCF> \
--format vcf --vcf --symbol --terms SO --tsl --biotype \
--hgvs --fasta <reference build FASTA file location> \
--offline --cache [--dir_cache <VEP cache directory>] \
--plugin Frameshift --plugin Wildtype \
[--dir_plugins <VEP_plugins directory>] [--pick] [--transcript_version]
Required VEP Options¶
--format vcf
--vcf
--symbol
--terms SO
--tsl
--biotype
--hgvs
--fasta <reference build FASTA location>
--offline
--cache
--plugin Frameshift
--plugin Wildtype
The
--format vcf
option specifies that the input file is in VCF format.The
--vcf
option will result in the output being written in VCF format.The
--symbol
option will include gene symbol in the annotation.The
--terms SO
option will result in Sequence Ontology terms being used for the consequences.The
--tsl
option adds transcript support level information to the annotation.The
--biotype
option adds biotype of the transcript or regulatory feature to the annotation.The
--hgvs
option will result in HGVS identifiers being added to the annotation.Using the
--hgvs
option requires the usage of the--fasta
argument to specify the location of the reference genome build FASTA file.The
--offline
option will eliminate all network connections for speed and/or privacy.The
--cache
option will result in the VEP cache being used for annotation.The
--plugin Frameshift
option will run the Frameshift plugin which will apply a frameshift mutation to a transcript sequence to compute the full mutated protein sequence.The
--plugin Wildtype
option will run the Wildtype plugin which will include the transcript protein sequence in the annotation.
Useful VEP Options¶
--dir_cache <VEP cache directory>
--dir_plugins <VEP_plugins directory>
--pick
--transcript_version
The
--dir_cache <VEP cache directory>
option may be needed if the VEP cache was downloaded to a different location than the default. The default location of the VEP cache is at$HOME/.vep
.The
--dir_plugins <VEP_plugins directory>
option may need to be set depending on where the VEP_plugins were installed to.The
--pick
option might be useful to limit the annotation to the “top” transcript for each variant (the one for which the most dramatic consequence is predicted). Otherwise, VEP will annotate each variant with all possible transcripts. pVACseq will provide predictions for all transcripts in the VEP CSQ field. Running VEP without the--pick
option can therefore drastically increase the runtime of pVACseq.The
--transcript_version
option will add the transcript version to the transcript identifiers. This option might be needed if you intend to annotate your VCF with expression information. Particularly if your expression estimation tool uses versioned transcript identifiers (e.g. ENST00000256474.2).
Additional VEP options that might be desired can be found here.