Output Files¶

The pVACseq pipeline will write its results in separate folders depending on which prediction algorithms were chosen:

MHC_Class_I: for MHC class I prediction algorithms
MHC_Class_II: for MHC class II prediction algorithms
combined: If both MHC class I and MHC class II prediction algorithms were run, this folder combines the neoepitope predictions from both

Each folder will contain the same list of output files (listed in the order created):

Filters applied to the filtered.tsv file¶

The filtered.tsv file is the all_epitopes file with the following filters applied (in order):

Binding Filter
Coverage Filter
Transcript Filter
Top Score Filter

Please see the Standalone Filter Commands documentation for more information on each individual filter. The standalone filter commands may be useful to reproduce the filtering or to chose different filtering thresholds.

Prediction Algorithms Supporting Presentation Scores¶

MHCflurryEL (Presentation and Processing)
NetMHCpanEL
NetMHCIIpanEL
BigMHC_EL

Prediction Algorithms Supporting Immunogenicity Scores¶

BigMHC_IM
DeepImmuno

Please note that when running pVACseq with only presentation or immunogenicity algorithms, no aggregate report and pVACview files are created.

Prediction Algorithms Supporting Percentile Information¶

pVACseq outputs percentile rank information when provided by a chosen binding affinity, presentation, or immunogenicity prediction algorithm. The following prediction algorithms calculate a percentile rank:

MHCflurry
MHCflurryEL (Presentation)
MHCnuggets
NetMHC
NetMHCcons
NetMHCpan
NetMHCpanEL
NetMHCIIpan
NetMHCIIpanEL
NNalign
PickPocket
SMM
SMMPMBEC
SMMalign

all_epitopes.tsv and filtered.tsv Report Columns¶

Column Name	Description
`Chromosome`	The chromosome of this variant
`Start`	The start position of this variant in the zero-based, half-open coordinate system
`Stop`	The stop position of this variant in the zero-based, half-open coordinate system
`Reference`	The reference allele
`Variant`	The alt allele
`Transcript`	The Ensembl ID of the affected transcript
`Transcript Support Level`	The transcript support level (TSL) of the affected transcript. `Not Supported` if the VCF entry doesn’t contain TSL information.
`Transcript Length`	The protein sequence length of the affected transcript
`Biotype`	The biotype of the affected transcript
`Ensembl Gene ID`	The Ensembl ID of the affected gene
`Variant Type`	The type of variant. `missense` for missense mutations, `inframe_ins` for inframe insertions, `inframe_del` for inframe deletions, and `FS` for frameshift variants
`Mutation`	The amnio acid change of this mutation
`Protein Position`	The protein position of the mutation
`Gene Name`	The Ensembl gene name of the affected gene
`HGVSc`	The HGVS coding sequence variant name
`HGVSp`	The HGVS protein sequence variant name
`HLA Allele`	The HLA allele for this prediction
`Peptide Length`	The peptide length of the epitope
`Sub-peptide Position`	The one-based position of the epitope within the protein sequence used to make the prediction
`Mutation Position`	A comma-separated list of all amino acid positions in the `MT Epitope Seq` that are different from the `WT Epitope Seq`. `NA` if the `WT Epitope Seq` is `NA`.
`MT Epitope Seq`	The mutant epitope sequence
`WT Epitope Seq`	The wildtype (reference) epitope sequence at the same position in the full protein sequence. `NA` if there is no wildtype sequence at this position or if more than half of the amino acids of the mutant epitope are mutated
`Best MT IC50 Score Method`	Prediction algorithm with the lowest mutant ic50 binding affinity for this epitope
`Best MT IC50 Score`	Lowest ic50 binding affinity of all prediction algorithms used
`Corresponding WT IC50 Score`	ic50 binding affinity of the wildtype epitope. `NA` if there is no `WT Epitope Seq`.
`Corresponding Fold Change`	`Corresponding WT IC50 Score` / `Best MT IC50 Score`. `NA` if there is no `WT Epitope Seq`.
`Best MT Percentile Method`	Prediction algorithm with the lowest binding affinity percentile rank for this epitope
`Best MT Percentile`	Lowest percentile rank of this epitope’s ic50 binding affinity of all prediction algorithms used (those that provide percentile output)
`Corresponding WT Percentile`	binding affinity percentile rank of the wildtype epitope. `NA` if there is no `WT Epitope Seq`.
`Tumor DNA Depth`	Tumor DNA depth at this position. `NA` if VCF entry does not contain tumor DNA readcount annotation.
`Tumor DNA VAF`	Tumor DNA variant allele frequency (VAF) at this position. `NA` if VCF entry does not contain tumor DNA readcount annotation.
`Tumor RNA Depth`	Tumor RNA depth at this position. `NA` if VCF entry does not contain tumor RNA readcount annotation.
`Tumor RNA VAF`	Tumor RNA variant allele frequency (VAF) at this position. `NA` if VCF entry does not contain tumor RNA readcount annotation.
`Normal Depth`	Normal DNA depth at this position. `NA` if VCF entry does not contain normal DNA readcount annotation.
`Normal VAF`	Normal DNA variant allele frequency (VAF) at this position. `NA` if VCF entry does not contain normal DNA readcount annotation.
`Gene Expression`	Gene expression value for the annotated gene containing the variant. `NA` if VCF entry does not contain gene expression annotation.
`Transcript Expression`	Transcript expression value for the annotated transcript containing the variant. `NA` if VCF entry does not contain transcript expression annotation.
`Median MT IC50 Score`	Median ic50 binding affinity of the mutant epitope across all prediction algorithms used
`Median WT IC50 Score`	Median ic50 binding affinity of the wildtype epitope across all prediction algorithms used. `NA` if there is no `WT Epitope Seq`.
`Median Fold Change`	`Median WT IC50 Score` / `Median MT IC50 Score`. `NA` if there is no `WT Epitope Seq`.
`Median MT Percentile`	Median binding affinity percentile rank of the mutant epitope across all prediction algorithms (those that provide percentile output)
`Median WT Percentile`	Median binding affinity percentile rank of the wildtype epitope across all prediction algorithms used (those that provide percentile output) `NA` if there is no `WT Epitope Seq`.
`Individual Prediction Algorithm WT and MT IC50 Scores and Percentiles` (multiple)	ic50 binding affintity and percentile ranks for the `MT Epitope Seq` and `WT Eptiope Seq` for the individual prediction algorithms used
`MHCflurryEL WT and MT Processing Score and Presentation Score and Percentile` (optional)	MHCflurry processing score and presentation score and percentiles for the `MT Epitope Seq` and `WT Epitiope Seq` if the run included MHCflurryEL as one of the prediction algorithms
`Index`	A unique idenitifer for this variant-transcript combination
`Problematic Positions` (optional)	A list of positions in the `MT Epitope Seq` that match the problematic amino acids defined by the `--problematic-amino-acids` parameter
`Gene of Interest` (T/F)	Is the `Gene Name` found in the genes of interest list?
`cterm_7mer_gravy_score`	Mean hydropathy of last 7 residues on the C-terminus of the peptide
`max_7mer_gravy_score`	Max GRAVY score of any kmer in the amino acid sequence. Used to determine if there are any extremely hydrophobic regions within a longer amino acid sequence.
`difficult_n_terminal_residue` (T/F)	Is N-terminal amino acid a Glutamine, Glutamic acid, or Cysteine?
`c_terminal_cysteine` (T/F)	Is the C-terminal amino acid a Cysteine?
`c_terminal_proline` (T/F)	Is the C-terminal amino acid a Proline?
`cysteine_count`	Number of Cysteines in the amino acid sequence. Problematic because they can form disulfide bonds across distant parts of the peptide
`n_terminal_asparagine` (T/F)	Is the N-terminal amino acid a Asparagine?
`asparagine_proline_bond_count`	Number of Asparagine-Proline bonds. Problematic because they can spontaneously cleave the peptide
`Best Cleavage Position` (optional)	Position of the highest predicted cleavage score
`Best Cleavage Score` (optional)	Highest predicted cleavage score
`Cleavage Sites` (optional)	List of all cleavage positions and their cleavage score
`Predicted Stability` (optional)	Stability of the pMHC-I complex
`Half Life` (optional)	Half-life of the pMHC-I complex
`Stability Rank` (optional)	The % rank stability of the pMHC-I complex
`NetMHCstab allele` (optional)	Nearest neighbor to the `HLA Allele`. Used for NetMHCstab prediction

all_epitopes.aggregated.tsv Report Columns¶

The all_epitopes.aggregated.tsv file is an aggregated version of the all_epitopes TSV. It shows the best-scoring epitope for each variant, and outputs additional binding affinity, expression, and coverage information for that epitope. It also gives information about the total number of well-scoring epitopes for each variant, the number of transcripts covered by those epitopes, as well as the HLA alleles that those epitopes are well-binding to. Lastly, the report will bin variants into tiers that offer suggestions as to the suitability of variants for use in vaccines.

Additionally, a metrics.json file gets created, containing metadata about the Best Peptide as well as alternate neoantigen canddiates for each variant. This file can be loaded into pVACview in conjunction with the aggregated report in order to visualize the candidates. In order to limit the size of the metrics.json file, only a limited number of neoantigen candidates are included in this file. Only neoantigen candidates meeting the --aggregate-inclusion-binding-threshold are included in this file (default: 5000). If the number of unique epitopes for a mutation meeting this threshold exceeds the --aggregate-inclusion-count-limit, only the top n epitopes up to this limit are included (default: 15). The method for selecting the top n epitopes is analogous to the one used to determine the best-scoring epitope. For each epitope of a mutation, all result entries (i.e. for different HLA alleles and transcripts) meeting the --aggregate-inclusion-binding-threshold are considered and the best entry is selected. The selection of best entry for each epitope are then sorted by the transcript biotype, the transcript support level, whether or not the anchor criteria was passed, the MT IC50 score, the transcript length, and the MT percentile. From this sorted list the top n entries are selected up to the --aggregate-inclusion-count-limit.

If the Best Peptide does not meet the aggregate inclusion criteria, it will be still be included in the metrics.json file and counted in the Num Included Peptides.

Whether the median or the lowest binding affinity metrics are used for determining the included epitopes, selecting the best-scoring epitope, and which values are output in the IC50 MT, IC50 WT, %ile MT, and %ile WT columns is controlled by the --top-score-metric parameter.

Column Name	Description
`ID`	A unique identifier for the variant
`Index`	A unique identifier for the variant and Best Transcript
HLA Alleles (multiple)	For each HLA allele in the run, the number of this variant’s epitopes that bound well to the HLA allele (with median/lowest mutant binding affinity < binding_threshold)
`Gene`	The Ensembl gene name of the affected gene
`AA Change`	The amino acid change for the mutation
`Num Passing Transcripts`	The number of transcripts for this mutation that resulted in at least one well-binding peptide (median/lowest mutant binding affinity < 500).
`Best Peptide`	The best-binding mutant epitope sequence (see Best Peptide Criteria below for more details on how this is determined)
`Best Transcript`	The best transcript of all transcripts coding for the Best Peptide (see Best Peptide Criteria below for more details on how this is determined)
`MANE Select` (True/False/Not Run)	Whether or not the Best Transcript is the MANE Select transcript. `Not Run` if VCF was VEP-annotated without the `--mane_select` flag.
`Canonical` (True/False/Not Run)	Whether or not the Best Transcript is the Canonical transcript. `Not Run` if VCF was VEP-annotated without the `--canonical` flag.
`TSL`	The Transcript Support Level of the Best Transcript. `Not Supported` reference is GRCh37 or older.
`Allele`	The Allele that the Best Peptide is binding to
`Pos`	A comma-separated list of all amino acid positions in the `MT Epitope Seq` that are different from the `WT Epitope Seq`. `NA` if the `WT Epitope Seq` is `NA`.
`Prob Pos`	A list of positions in the Best Peptide that are problematic. `None` if none of the Best Peptide amino acids are problematic or if the `--problematic-pos` parameter was not set during the pVACseq run.
`Num Included Peptides`	The number of included peptides according to the `--aggregate-inclusion-binding-threshold` and `--aggregate-inclusion-count-limit`
`Num Passing Peptides`	The number of included peptides for this mutation that are well-binding.
`IC50 MT`	Median or lowest ic50 binding affinity of the best-binding mutant epitope across all prediction algorithms used
`IC50 WT`	Median or lowest ic50 binding affinity of the corresponding wildtype epitope across all prediction algorithms used.
`%ile MT`	Median or lowest binding affinity percentile rank of the best-binding mutant epitope across all prediction algorithms used (those that provide percentile output)
`%ile WT`	Median or lowest binding affinity percentile rank of the corresponding wildtype epitope across all prediction algorithms used (those that provide percentile output)
`RNA Expr`	Gene expression value for the annotated gene containing the variant.
`RNA VAF`	Tumor RNA variant allele frequency (VAF) at this position.
`Allele Expr`	RNA Expr * RNA VAF
`RNA Depth`	Tumor RNA depth at this position.
`DNA VAF`	Tumor DNA variant allele frequency (VAF) at this position.
`Tier`	A tier suggesting the suitability of variants for use in vaccines.
`Ref Match` (True/False/Not Run)	Wether or not there a match of the mutated peptide sequence to the reference proteome. `Not Run` if `--run-reference-proteome-simlarity` flag was not set during the pVACseq run.
`Evaluation`	Column to store the evaluation of each variant when evaluating the run in pVACview. Either `Accept`, `Reject`, or `Review`.

<sample_name>_predict_pvacview.tsv Report Columns¶

The <sample_name>_predict_pvacview.tsv file is generated when using the add_ml_predictions tool or when running pVACseq with both MHC Class I and Class II predictions and the --run-ml-predictions flag enabled. This file contains all columns from the Class I aggregated file (all_epitopes.aggregated.tsv) with one additional ML prediction column added.

The file is written to the ml_predict subdirectory within the output directory.

All columns from all_epitopes.aggregated.tsv

All columns described in the all_epitopes.aggregated.tsv Report Columns section above are included in this file.

Evaluation

Populated with ML-predicted evaluation status for each candidate. Values: Accept for variants with prediction probability >= threshold_accept (default: 0.55), Reject for variants with prediction probability <= threshold_reject (default: 0.30), and Pending for variants with prediction probability between threshold_reject and threshold_accept or when the ML model cannot make a prediction due to missing data.

ML Prediction (score)

ML-based prediction evaluation with probability score. Format: "<Evaluation> (<probability_score>)" (e.g., "Accept (0.72)", "Reject (0.15)", "Review (0.48)"). Shows "NA" when the ML model cannot make a prediction due to missing data (e.g., when Class I and Class II aggregated files have different numbers of rows).

Best Peptide Criteria¶

To determine the Best Peptide, all peptides meeting the --aggregate-inclusion-threshold and --aggregate-inclusion-count-limit (see above) for a variant are evaluated as follows:

If --allow-inclomplete-transcripts flag is set, pick the entries without a Transcript CDS Flags set.
Of the remaining entries, pick the entries where the Biotype is protein_coding.
Of the remaining entries, pick the entries that pass at least one of the transcript criteria selected in the --transcript-prioritization-strategy taking into consideration the --maximum-transcript-support-level if tsl is one of the selected criteria.
Of the remaining entries, pick the entries with no Problematic Positions.
Of the remaining entries, pick the ones passing the Anchor Criteria (see Criteria Details section below)
Sort the remaining entries by lowest Median|Best MT IC50 Score|Percentile (depending on the selected --top-score-metric and --top-score-metric2), MANE Select (True), Canonical (True), Transcript Support Level, Transcript Length, and Transcript Expression. Select the highest sorted entry.

The pVACseq Aggregate Report Tiers¶

Tiering Parameters¶

To tier the Best Peptide, several cutoffs can be adjusted using arguments provided to the pVACseq run:

Parameter	Description	Default
`--binding-threshold`	The threshold used for filtering epitopes on the IC50 MT binding affinity.	500
`--allele-specific-binding-thresholds`	Instead of the hard cutoff set by the `--binding-threshold`, use allele-specific binding thresholds. For alleles where no allele-specific binding threshold is available, use the `--binding-threshold` as a fallback. To print a list of alleles that have specific binding thresholds and the value of those thresholds, run `pvacseq allele_specific_cutoffs`.	False
`--percentile-threshold`	When set, use this threshold to filter epitopes on the %ile MT score in addition to having to meet the binding threshold.	None
`--percentile-threshold-strategy`	Specify the candidate inclusion strategy. The `conservative` option requires a candidate to pass BOTH the binding threshold and percentile threshold (if set). The `exploratory` option requires a candidate to pass EITHER the binding threshold or the percentile threshold.	conservative
`--tumor-purity`	Value between 0 and 1 indicating the fraction of tumor cells in the tumor sample. Information is used for a simple estimation of whether variants are subclonal or clonal based on VAF. If not provided, purity is estimated directly from the VAFs.	None
`--trna-vaf`	Tumor RNA VAF Cutoff. Used to calculate the allele expression cutoff for tiering.	0.25
`--trna-cov`	Tumor RNA Coverage Cutoff. Used as a cutoff for tiering.	10
`--expn-val`	Gene and Expression cutoff. Used to calculate the allele expression cutoff for tiering.	1.0
`--transcript-prioritization-strategy`	Which transcript-specific criteria to consider to pass a transcript.	[‘mane_select’, ‘canonical’, ‘tsl’]
`--maximum-transcript-support-level`	The threshold to evaluate an epitope’s best transcript on the Ensembl transcript support level (TSL). Transcript support level needs to be <= this cutoff to be included most tiers when tsl is included as transcript prioritization strategy.	1
`--allele-specific-anchors`	Use allele-specific anchor positions when tiering epitopes in the aggregate report. This option is available for 8, 9, 10, and 11mers and only for HLA-A, B, and C alleles. If this option is not enabled or as a fallback for unsupported lengths and alleles, the default positions of [1, 2, epitope length - 1, and epitope length] are used. Please see https://doi.org/10.1101/2020.12.08.416271 for more details.	False
`--anchor-contribution-threshold`	For determining allele-specific anchors, each position is assigned a score based on how binding is influenced by mutations. From these scores, the relative contribution of each position to the overall binding is calculated. Starting with the highest relative contribution, positions whose score together account for the selected contribution threshold are assigned as anchor locations. As a result, a higher threshold leads to the inclusion of more positions to be considered anchors.	0.8
`--run-reference-proteome-similarity`	Set this flag in order to run reference proteome similarity analysis and enable `RefMatch` tiering. Use `--blastp-path`, `--blastp-db`, and `--peptide-fasta` parameters to configure your run.	False
`--problematic-amino-acids`	Configure this parameter in order to define amino acids problematic for the desired therapy delivery platform and enable `ProbPos` tiering.	None

Tiers¶

Given the thresholds provided above, the Best Peptide is evaluated and binned into a tier as follows:

Tier	Citeria
`Pass`	Best Peptide passes the binding, reference match, expression, transcript, clonal, problematic position, and anchor criteria
`PoorBinder`	Best Peptide fails the binding criteria but passed the reference match, expression, transcript, clonal, problematic position, and anchor criteria
`RefMatch`	Best Peptide fails the reference match criteria but passes the binding, expression, transcript, clonal, problematic position, and anchor criteria
`PoorTranscript`	Best Peptide fails the transcript criteria but passes the binding, reference match, expression, clonal, problematic position, and anchor criteria
`LowExpr`	Best Peptide meets the low expression criteria and passes the binding, reference match, transcript, clonal, problematic position, and anchor criteria
`Anchor`	Best Peptide fails the anchor criteria but passes the binding, reference match, expression, transcript, clonal, and problematic position criteria
`Subclonal`	Best Peptide fails the clonal criteria but passes the binding, reference match, expression, transcript, problematic position, and anchor criteria
`ProbPos`	Best Peptide fails the problematic position criteria but passes the binding, reference match, expression, transcript, clonal, and anchor criteria
`Poor`	Best Peptide doesn’t fit in any of the above tiers, usually if it fails two or more criteria
`NoExpr`	Best Peptide is not expressed (RNA Expr == 0 or RNA VAF == 0)

Criteria Details¶

Criteria	Description	Evaluation Logic
Binding Criteria	Pass if Best Peptide is strong binder	binding score criteria: `IC50 MT < binding_threshold` percentile score criteria (if `--percentile-threshold` parameter is set): `%ile MT < percentile_threshold` `conservative` `--percentile-threshold-strategy`: needs to pass BOTH the binding score criteria AND the percentile score criteria `exploratory` `--percentile-threshold-strategy`: needs to pass EITHER the binding score criteria OR the percentile score criteria
Expression Criteria	Pass if Best Transcript is expressed	`Allele Expr > trna_vaf * expn_val`
Reference Match Criteria	Pass if there are no reference protome matches	`Ref Match == False`
Transcript Criteria	Pass if Best Transcript matches any of the user-specified `--transcript-prioritization-strategy` criteria	`TSL <= maximum_transcript_support_level` (if `--transcript-prioritization-strategy` includes `tsl`) `MANE Select == True` (if --transcript-prioritization-strategy includes ``mane_select) `Canonical == True` (if `--transcript-prioritization-strategy` incluces `canonical`)
Low Expression Criteria	Peptide has low expression or no expression but RNA VAF and coverage	`(0 < Allele Expr < trna_vaf * expn_val) OR (RNA Expr == 0 AND RNA Depth > trna_cov AND RNA VAF > trna_vaf)`
Anchor Criteria	Fail if if there are <= 2 mutated amino acids and all mutated amino acids of the Best Peptide (`Pos`) are at an anchor position and the WT peptide has good binding `(IC50 WT < binding_threshold)`
Clonal Criteria	Best Peptide is likely in the founding clone of the tumor	`DNA VAF > tumor_purity / 4`
Problematic Position Criteria	Best Peptide does not contain a problematic amino acid as defined by the `--problematic-amino-acids` parameters	`Prob Pos == None`

The pVACseq Aggregate Report Sorting¶

The aggregate report is sorted as follows:

Sort Criteria	Sort Order
`Tier` column	“Pass”, “PoorBinder”, “RefMatch”, “PoorTranscript”, “LowExpr”, “Anchor”, “Subclonal”, “ProbPos”, “Poor”, “NoExpr”
Ascending rank of `Allele Expr` column + ascending rank of either `IC50 MT` column (if `--top-score-metric` is `ic50`) or `%ile MT` column (if `--top-score-metric` is `percentile`)	Ascending sum rank
`Gene` column	Alphabetical
`AA Change` column	Alphabetical

aggregated.tsv.reference_matches Report Columns¶

This file is only generated when the --run-reference-proteome-similarity option is chosen.

Column Name	Description (BLAST)	Description (reference fasta)
`Chromosome`	The chromosome of this variant
`Start`	The start position of this variant in the zero-based, half-open coordinate system
`Stop`	The stop position of this variant in the zero-based, half-open coordinate system
`Reference`	The reference allele
`Variant`	The alt allele
`Transcript`	The Ensembl ID of the affected transcript
`MT Epitope Seq`	The mutant peptide sequence for the epitope candidate
`Peptide`	The peptide sequence submitted to BLAST	The peptide sequence to search for in the reference proteome
`Hit ID`	The BLAST alignment hit ID (reference proteome sequence ID)	The FASTA header ID of the entry where the match was made
`Hit Definition`	The BLAST alignment hit definition (reference proteome sequence name)	The FASTA header description of the entry where the match was made
`Match Window`	The substring of the `Peptide` that was found in the `Match Sequence`
`Match Sequence`	The BLAST match sequence	The FASTA sequence of the entry where the match was made
`Match Start`	The match start position of the `Match Window` in the `Match Sequence`
`Match Stop`	The match stop position of the `Match Window` in the `Match Sequence`

Table of Contents

Previous topic

Next topic