Annotating your VCF with VEP¶
The input to the pVACseq pipeline is a VEP-annotated VCF. This will add consequence, transcript, and gene information to your VCF.
To download and install the VEP command line tool follow the VEP installation instructions.
We recommend the use of the VEP cache for your annotation. The VEP cache can be downloaded following these VEP cache installation instructions. Please ensure that the Ensembl cache version matches the reference build and Ensembl version used in other parts of your analysis (e.g. for RNA-seq gene/transcript abundance estimation).
Download the VEP plugins from the GitHub repository by cloning the repository:
git clone https://github.com/Ensembl/VEP_plugins.git
Copy the Wildtype and Frameshift plugins provided with the pVACseq package to the folder with the other VEP plugins by running the following command:
pvacseq install_vep_plugin <VEP plugins directory>
Example VEP Command
./vep \ --input_file <input VCF> --output_file <output VCF> \ --format vcf --vcf --symbol --terms SO --tsl --biotype \ --hgvs --fasta <reference build FASTA file location> \ --offline --cache [--dir_cache <VEP cache directory>] \ --plugin Frameshift --plugin Wildtype \ [--dir_plugins <VEP_plugins directory>] [--pick] [--transcript_version]
Required VEP Options¶
--format vcf --vcf --symbol --terms SO --tsl --biotype --hgvs --fasta <reference build FASTA location> --offline --cache --plugin Frameshift --plugin Wildtype
--format vcfoption specifies that the input file is in VCF format.
--vcfoption will result in the output being written in VCF format.
--symboloption will include gene symbol in the annotation.
--terms SOoption will result in Sequence Ontology terms being used for the consequences.
--tsloption adds transcript support level information to the annotation.
--biotypeoption adds biotype of the transcript or regulatory feature to the annotation.
--hgvsoption will result in HGVS identifiers being added to the annotation.
--hgvsoption requires the usage of the
--fastaargument to specify the location of the reference genome build FASTA file.
--offlineoption will eliminate all network connections for speed and/or privacy.
--cacheoption will result in the VEP cache being used for annotation.
--plugin Frameshiftoption will run the Frameshift plugin which will apply a frameshift mutation to a transcript sequence to compute the full mutated protein sequence.
--plugin Wildtypeoption will run the Wildtype plugin which will include the transcript protein sequence in the annotation.
Useful VEP Options¶
--dir_cache <VEP cache directory> --dir_plugins <VEP_plugins directory> --pick --transcript_version
--dir_cache <VEP cache directory>option may be needed if the VEP cache was downloaded to a different location than the default. The default location of the VEP cache is at
--dir_plugins <VEP_plugins directory>option may need to be set depending on where the VEP_plugins were installed to.
--pickoption might be useful to limit the annotation to the “top” transcript for each variant (the one for which the most dramatic consequence is predicted). Otherwise, VEP will annotate each variant with all possible transcripts. pVACseq will provide predictions for all transcripts in the VEP CSQ field. Running VEP without the
--pickoption can therefore drastically increase the runtime of pVACseq.
--transcript_versionoption will add the transcript version to the transcript identifiers. This option might be needed if you intend to annotate your VCF with expression information. Particularly if your expression estimation tool uses versioned transcript identifiers (e.g. ENST00000256474.2).
Additional VEP options that might be desired can be found here.